Closed japhir closed 4 years ago
Hmm when I read the file that's giving errors separately, I also get a bunch of errors:
run1 <- iso_read_dual_inlet("~/Downloads/archive/motu/dids/_180223_1/")
Info: preparing to read 1 data files (all will be cached)...
Info: reading file '180223_1_Kiel Std test_1_ETH-1.did' from cache...
Progress: [================================================================================================] 1/1 (100%) 0s
Info: finished reading 1 files in 0.09 secs
Info: encountered 2 problems in total
# A tibble: 2 x 4
file_id type func details
<chr> <chr> <chr> <chr>
1 180223_1_Kiel Std test_… error extract_did_raw_vol… cannot locate voltage data - block 'CTwoDoublesArrayData' not found af…
2 180223_1_Kiel Std test_… error extract_did_vendor_… cannot process vendor computed data table - block 'CDualInletEvaluated…
Warning message:
Column `path` has different attributes on LHS and RHS of join
here's the file:
The above errors have made me look into the raw files themselves. Currently trying to do an rsync
with md5-sums so that I'm certain that it's not a copying error. This may take some time because I have some samba issues now. I'll get back to this as soon as I have a response from tech support on that!
Yep, seems like this was an issue with half-copied files, since it does work now.
I still get this Column `path` has different attributes on LHS and RHS of join
warning though
Thanks for testing so carefully. Could you send me a small example file and code to reproduce the warning? Some of it might be recent changes in dplyr (version 1.0 coming up fast which will likely break a few things).
On Thu, Apr 2, 2020 at 6:49 AM Ilja Kocken notifications@github.com wrote:
Yep, seems like this was an issue with half-copied files, since it does work now.
I still get this Column
path
has different attributes on LHS and RHS of join warning though— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/isoverse/isoreader/issues/110#issuecomment-607825848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJ6QVUWJNMA6CGFIKRYKTLRKSCXLANCNFSM4LZCY2OA .
unzip that file to wherever, then iso_read_dual_inlet("~/Download/folder")
should do it, I think!
I get the following output without any other warnings. Could you share your sessionInfo()
?
Info: preparing to read 1 data files (all will be cached)...
Info: reading file '180223_1_Kiel Std test_1_ETH-1.did' with '.did' reader
Warning: caught error - cannot locate voltage data - block 'CTwoDoublesArrayData' not found after position 1 (pos 113311)
Warning: caught error - cannot process vendor computed data table - block 'CDualInletEvaluatedData' not found after position 1 (pos 4...
Progress: [=============================================================================================================] 1/1 (100%) 1s
Info: finished reading 1 files in 1.87 secs
Info: encountered 2 problems in total
# A tibble: 2 x 4
file_id type func details
<chr> <chr> <chr> <chr>
1 180223_1_Kiel Std test_1_… error extract_did_raw_volta… cannot locate voltage data - block '
CTwoDoublesArrayData' not found after posit…
2 180223_1_Kiel Std test_1_… error extract_did_vendor_da… cannot process vendor computed data table - block 'CDualInletEvaluatedData' not…
Dual inlet iso file '180223_1_Kiel Std test_1_ETH-1.did': 0 cycles, 0 ions ()
Problems:
# A tibble: 2 x 4
file_id type func details
<chr> <chr> <chr> <chr>
1 180223_1_Kiel Std test_1_… error extract_did_raw_volta… cannot locate voltage data - block 'CTwoDoublesArrayData' not found after posit…
2 180223_1_Kiel Std test_1_… error extract_did_vendor_da… cannot process vendor computed data table - block 'CDualInletEvaluatedData' not…
Looks like there are some issues with the old caf files now, which is very unfortunate. They are suddenly ALL marked as problematic files. When I run iso_get_file_info()
on my whole set of cafs I get the following:
> iso_get_file_info(cafs)
Info: aggregating file info from 4928 data file(s)
Error: No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/vctrs_error_incompatible_type>
No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
Backtrace:
1. isoreader::iso_get_file_info(cafs)
14. vctrs:::vec_ptype2.POSIXt.default(...)
15. vctrs::vec_default_ptype2(x, y, x_arg = x_arg, y_arg = y_arg)
16. vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
17. vctrs:::stop_incompatible(...)
18. vctrs:::stop_vctrs(...)
Run `rlang::last_trace()` to see the full context.
>rlang::last_trace()
<error/vctrs_error_incompatible_type>
No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
Backtrace:
█
1. ├─isoreader::iso_get_file_info(cafs)
2. │ └─`%>%`(...)
3. │ ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
4. │ └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
5. │ └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
6. │ └─isoreader:::`_fseq`(`_lhs`)
7. │ └─magrittr::freduce(value, `_function_list`)
8. │ ├─base::withVisible(function_list[[k]](value))
9. │ └─function_list[[k]](value)
10. │ └─isoreader:::safe_bind_rows(.)
11. │ └─vctrs::vec_rbind(...)
12. ├─vctrs:::vec_ptype2_dispatch_s3(x = x, y = y, x_arg = x_arg, y_arg = y_arg)
13. ├─vctrs::vec_ptype2.POSIXt(x = x, y = y, x_arg = x_arg, y_arg = y_arg)
14. └─vctrs:::vec_ptype2.POSIXt.default(...)
15. └─vctrs::vec_default_ptype2(x, y, x_arg = x_arg, y_arg = y_arg)
16. └─vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
17. └─vctrs:::stop_incompatible(...)
18. └─vctrs:::stop_vctrs(...)
Ok I think I'm being stupid. I had this issue for my newest files first, then it was fixed after I rsync'd without the --ignore-existing
flag. Now I had it for the older caf files, but hadn't removed the flag yet. Don't spend time trying to fix this yet please ;-).
sounds good. I do think there might be some dplyr issues with 0.8.5 (and the upcoming 1.0) that we need to address. The newest dplyr has implements bind_rows()
in a new way that I'm pretty sure is crashing the iso_get_
functions for more complicated data columns like the file_datetime
Aww unfortunately that was not the problem. All my old caf files don't work anymore, even after double-checking that they were copied over correctly.
So iso_get_file_info()
breaks for the caf files. For the did files it's just become very slow.
Also, just using iso_read_dual_inlet()
on one of these summary rds files is very slow, so probably some of the integrity checks have also broken? With read_rds()
or readRDS()
it's much faster.
I cannot reproduce your error even with your versions of dplyr and vctrs. Could this be an issue with the cached files? Can you run an example with read_cache = FALSE
and quiet=FALSE
so I can get a better sense for output?
Hmm that's very weird. I've just updated my system and vctrs
, dplyr
and isoreader
, and even with this single caf file I get issues. Are you saying you don't get these warnings/errors with the one file on your system either? Or just not the error related to the different file_datetime
formats?
also: I edited all above posts to use the <details>
flags so the long logs are collapsed by default.
Maybe it's because you ran it on the did
files in https://github.com/isoverse/isoreader/issues/110#issuecomment-608047716 in stead of the single caf
file?
found it, it's tibble 3.0!!
Hi @japhir , can you try if devtools::install_github("isoverse/isoreader", ref = "dev")
solves the problem?
That's great! Thanks for implementing a fix so soon. I've updated to the dev
version again for now, and it appears to be working. It now doesn't give me the LHS and RHS errors when I read in the files, but the old summary files and cached files are very slow still. It looks like I'm going to have to re-import all (caf) files again with the read_cache
flag off, because applying iso_get_file_info()
to the cached versions still results in an error. That'll take a while so I'll get back to you on that later.
It's just finished re-reading the 4928 caf files! It has now found 1376 files with problems, of which a lot are duplicate files.
I get the below warning when I saved the aggregate file to rds with iso_save()
Warning messages:
1: In max(.data$pos) : no non-missing arguments to max; returning -Inf
2: In max(.data$pos) : no non-missing arguments to max; returning -Inf
3: Unknown or uninitialised column: `block`.
4: Unknown or uninitialised column: `block`.
5: Unknown or uninitialised column: `block`.
Reading in the newly created summary rds file is still slow (20.43 secs! vs read_rds
which is basically instantaneous).
iso_get_file_info()
unfortunately fails still on the caf files :cry:.
Of course I should have just limited it to the two files that are actually indicated to be the error message. That would have saved me 2 hours of unnecessary computation ;-). Anyway, here they are problematic_2_files.zip
This seems resolved now, also in the master branch! Read/save speeds of the raw files are back to before and I don't get errors on saving the rds! Running iso_read_dual_inlet()
on the saved rds
file is still slower than a plain read_rds()
, however.
Hi @japhir . The whole cache file system is actually revamped in the release yesterday (1.2.0) so cache files can be copied, have more useful files names to know which is which and allow skipping the data integrity checks for files that are up to date (should make reading .rds
files similarly fast to direct readRDS
). You do need to re-generate your cache but it's easy now with iso_reread_oudated_files(iso_files)
and hopefully the last time for a long time we need to make any structural changes like this. Would love to hear if it works.
By the way, notifications about isoverse are no in a repo for this purpose, take a look: https://github.com/isoverse/news/issues/2
Hi @sebkopf, thanks for the notice. I've just updated to R 4.0 and the newest isoreader, but I think something must have gone wrong somewhere… re-reading the whole database took about twice as long as last time (with very few new files, as you can imagine) and while iso_get_file_info()
works, it's also much slower than before. iso_get_raw_data()
hasn't finished as I'm typing this...
How can I help debug this?
Hi @japhir, can you send a small excerpt of your entire collection? Nothing has changed in iso_get_file_info()
but I have not yet tested any of these in R 4.0. I think the new R does a lot more type cast checks that might make those built into iso_get_file_info()
redundant (but also makes it slower in R 4.0 since they're essentially done twice). The changes with the caching should just make reading cached files and .rds
files faster, not inherently the data collection. The speed of iso_get_raw_data()
is mostly limited by how quick tidyverse functions work (unless you also bring in all the file info) and with the switch to vctrs
some things have gotten slower :( - not sure yet whether the benefit of type cast checks in vctrs
really outweighs the speed losses
Just finished reading in everything. Didn't get any particular warnings on the newer did files, but got these on the caf files: (again, much slower than before).
Info: exporting data from 4928 iso_files into R Data Storage '/home/japhir/SurfDrive/PhD/programming/dataprocessing/out/cafs.di.rds'
Warning messages:
1: In max(.data$pos) : no non-missing arguments to max; returning -Inf
2: In max(.data$pos) : no non-missing arguments to max; returning -Inf
3: Unknown or uninitialised column: `block`.
4: Unknown or uninitialised column: `block`.
5: Unknown or uninitialised column: `block`.
All of the previously shared files in this thread should be good, the raw data haven't chaged. How big of a subset were you thinking? I was hesitant to share many earlier, but just asked my supervisor and he says it shouldn't be a problem to share some files.
that's great! I was actually thinking not the raw files since they don't cause trouble for me but just parts of the isofile collection, so something like this:
iso_files <- iso_read_dual_inlet("....rds")
# pick 100 random files from the collection
iso_files[sample(1:length(iso_files), 100)] %>% iso_save("for_seb.di.rds")
as for that .data$pos
warning, could you see if you can pinpoint where it occurs with the following flags to elevate warnings to errors and not catch them?
options(warn = 2)
isoreader:::iso_turn_debug_on(catch_errors = FALSE)
iso_read_dual_inlet(....)
Ok @sebkopf, here's the test file with 100 standards!
I generated them like this:
seb_sub <- dids %>%
iso_filter_files(Comment == "STD") # for standard
# evenly spaced throughout the record, not sure if it's sorted by file_datetime though,
# so could still be random.
seb_sub <- seb_sub[(floor(seq(1, length(seb_sub), length.out = 100)))] %>%
iso_save("out/for_seb.di.rds")
I tried to have a look at where it's getting slow with profvis, but I don't really understand the graph so I'll leave that up to you ;-)
library(profvis)
library(isoreader)
profvis({
dids <- iso_read_dual_inlet("out/for_seb.di.rds")
didinfo <- dids %>%
iso_get_file_info()
rawdata <- dids %>%
iso_get_raw_data()
})
regarding the debugging request: this doesn't work because of the duplicated files
options(warn = 2)
isoreader:::iso_turn_debug_on(catch_errors = FALSE)
setwd("~/Documents/archive/")
isoreader::iso_read_dual_inlet("~/Documents/archive/pacman/cafs",
discard_duplicates = FALSE)
Having to work from home got me quite frustrated with the extremely slow vpn connection I have to the rawdata samba drive, so I copied everything over with some nice
rsync
scripts. I used the-t
flag in rsync, which is supposed to preserve modification times. This seems to have gone wrong however:Now I did manage to read in all the data, but when I try to
iso_get_file_info()
for all ~15k files, it results in the below errors:Running any of the other isoreader functions is also extremely slow: just reading in the 104MB rds file with
dids <- iso_read_dual_inlet("out/dids.di.rds")
takes ~2.11 minutes, probably because it's performing some checks?read_rds("out/dids.di.rds")
takes approximately 7 seconds.iso_filter_files()
is also non-functional on the whole dataset.Any ideas on how to fix this?