isoverse / isoreader

Read IRMS (Isotope Ratio Mass Spectrometry) data files into R
http://isoreader.isoverse.org
GNU General Public License v2.0
8 stars 6 forks source link

reading in many (old) files after copying them over to my linux drive fails #110

Closed japhir closed 4 years ago

japhir commented 4 years ago

Having to work from home got me quite frustrated with the extremely slow vpn connection I have to the rawdata samba drive, so I copied everything over with some nice rsync scripts. I used the -t flag in rsync, which is supposed to preserve modification times. This seems to have gone wrong however:

Now I did manage to read in all the data, but when I try to iso_get_file_info() for all ~15k files, it results in the below errors:

> iso_get_file_info(dids)
Info: aggregating file info from 14985 data file(s)
Error: No common type for `180223_1_Kiel Std test_1_ETH-1.did$file_datetime` <datetime<Europe/Amsterdam>> and `180307_2_Kiel Std test_30_ETH-3.did$file_datetime` <integer>.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/vctrs_error_incompatible_type>
No common type for `180223_1_Kiel Std test_1_ETH-1.did$file_datetime` <datetime<Europe/Amsterdam>> and `180307_2_Kiel Std test_30_ETH-3.did$file_datetime` <integer>.
Backtrace:
  1. utils:::.ess.eval(...)
 27. vctrs:::vec_ptype2.POSIXt.default(...)
 28. vctrs::vec_default_ptype2(x, y, x_arg = x_arg, y_arg = y_arg)
 29. vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
 30. vctrs:::stop_incompatible(...)
 31. vctrs:::stop_vctrs(...)
Run `rlang::last_trace()` to see the full context.

Running any of the other isoreader functions is also extremely slow: just reading in the 104MB rds file with dids <- iso_read_dual_inlet("out/dids.di.rds") takes ~2.11 minutes, probably because it's performing some checks? read_rds("out/dids.di.rds") takes approximately 7 seconds.

iso_filter_files() is also non-functional on the whole dataset.

Any ideas on how to fix this?

japhir commented 4 years ago

Hmm when I read the file that's giving errors separately, I also get a bunch of errors:

run1 <- iso_read_dual_inlet("~/Downloads/archive/motu/dids/_180223_1/")
Info: preparing to read 1 data files (all will be cached)...
Info: reading file '180223_1_Kiel Std test_1_ETH-1.did' from cache...
Progress: [================================================================================================] 1/1 (100%)  0s
Info: finished reading 1 files in 0.09 secs
Info: encountered 2 problems in total
# A tibble: 2 x 4
  file_id                  type  func                 details                                                                
  <chr>                    <chr> <chr>                <chr>                                                                  
1 180223_1_Kiel Std test_… error extract_did_raw_vol… cannot locate voltage data - block 'CTwoDoublesArrayData' not found af…
2 180223_1_Kiel Std test_… error extract_did_vendor_… cannot process vendor computed data table - block 'CDualInletEvaluated…

Warning message:
Column `path` has different attributes on LHS and RHS of join 

here's the file:

180223_1_Kiel Std test_1_ETH-1.zip

japhir commented 4 years ago

The above errors have made me look into the raw files themselves. Currently trying to do an rsync with md5-sums so that I'm certain that it's not a copying error. This may take some time because I have some samba issues now. I'll get back to this as soon as I have a response from tech support on that!

japhir commented 4 years ago

Yep, seems like this was an issue with half-copied files, since it does work now.

I still get this Column `path` has different attributes on LHS and RHS of join warning though

sebkopf commented 4 years ago

Thanks for testing so carefully. Could you send me a small example file and code to reproduce the warning? Some of it might be recent changes in dplyr (version 1.0 coming up fast which will likely break a few things).

On Thu, Apr 2, 2020 at 6:49 AM Ilja Kocken notifications@github.com wrote:

Yep, seems like this was an issue with half-copied files, since it does work now.

I still get this Column path has different attributes on LHS and RHS of join warning though

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/isoverse/isoreader/issues/110#issuecomment-607825848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJ6QVUWJNMA6CGFIKRYKTLRKSCXLANCNFSM4LZCY2OA .

japhir commented 4 years ago

180223_1_Kiel Std test_1_ETH-1.zip

unzip that file to wherever, then iso_read_dual_inlet("~/Download/folder") should do it, I think!

sebkopf commented 4 years ago

I get the following output without any other warnings. Could you share your sessionInfo()?

Info: preparing to read 1 data files (all will be cached)...                                                                            
Info: reading file '180223_1_Kiel Std test_1_ETH-1.did' with '.did' reader                                                              
Warning: caught error - cannot locate voltage data - block 'CTwoDoublesArrayData' not found after position 1 (pos 113311)               
Warning: caught error - cannot process vendor computed data table - block 'CDualInletEvaluatedData' not found after position 1 (pos 4...
Progress: [=============================================================================================================] 1/1 (100%)  1s
Info: finished reading 1 files in 1.87 secs
Info: encountered 2 problems in total
# A tibble: 2 x 4
  file_id                    type  func                   details                                                                         
  <chr>                      <chr> <chr>                  <chr>                                                                           
1 180223_1_Kiel Std test_1_… error extract_did_raw_volta… cannot locate voltage data - block '
CTwoDoublesArrayData' not found after posit…
2 180223_1_Kiel Std test_1_… error extract_did_vendor_da… cannot process vendor computed data table - block 'CDualInletEvaluatedData' not…

Dual inlet iso file '180223_1_Kiel Std test_1_ETH-1.did': 0 cycles, 0 ions () 
Problems:
# A tibble: 2 x 4
  file_id                    type  func                   details                                                                         
  <chr>                      <chr> <chr>                  <chr>                                                                           
1 180223_1_Kiel Std test_1_… error extract_did_raw_volta… cannot locate voltage data - block 'CTwoDoublesArrayData' not found after posit…
2 180223_1_Kiel Std test_1_… error extract_did_vendor_da… cannot process vendor computed data table - block 'CDualInletEvaluatedData' not…
japhir commented 4 years ago
log of running it on one file, quietly, with caching ``` r library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union library(isoreader) #> #> Attaching package: 'isoreader' #> The following object is masked from 'package:stats': #> #> filter setwd("~/Downloads") cafs <- iso_read_dual_inlet("170126_170124_Sibren_run29-1426.caf", cache = TRUE, quiet = TRUE, discard_duplicates = FALSE, parallel = TRUE) #> Warning: Column `path` has different attributes on LHS and RHS of join iso_get_problems(cafs) #> # A tibble: 1 x 4 #> file_id type func details #> #> 1 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value` … sessionInfo() #> R version 3.6.3 (2020-02-29) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Arch Linux #> #> Matrix products: default #> BLAS: /usr/lib/libopenblasp-r0.3.9.so #> LAPACK: /usr/lib/liblapack.so.3.9.0 #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] isoreader_1.1.4 dplyr_0.8.5 #> #> loaded via a namespace (and not attached): #> [1] zip_2.0.4 Rcpp_1.0.4 pillar_1.4.3 compiler_3.6.3 #> [5] highr_0.8 prettyunits_1.1.1 progress_1.2.2 R.methodsS3_1.8.0 #> [9] R.utils_2.9.2 base64enc_0.1-3 tools_3.6.3 digest_0.6.25 #> [13] rhdf5_2.30.1 lubridate_1.7.4 evaluate_0.14 lifecycle_0.2.0 #> [17] tibble_3.0.0 pkgconfig_2.0.3 rlang_0.4.5 openxlsx_4.1.4 #> [21] cli_2.0.2 yaml_2.2.1 parallel_3.6.3 xfun_0.12 #> [25] xml2_1.2.5 stringr_1.4.0 knitr_1.28 vctrs_0.2.4 #> [29] globals_0.12.5 hms_0.5.3 tidyselect_1.0.0 glue_1.3.2 #> [33] listenv_0.8.0 R6_2.4.1 fansi_0.4.1 rmarkdown_2.1 #> [37] tidyr_1.0.2 Rhdf5lib_1.8.0 readr_1.3.1 purrr_0.3.3 #> [41] magrittr_1.5 feather_0.3.5 codetools_0.2-16 ellipsis_0.3.0 #> [45] htmltools_0.4.0 assertthat_0.2.1 future_1.16.0 UNF_2.0.6 #> [49] utf8_1.1.4 stringi_1.4.6 crayon_1.3.4 R.oo_1.23.0 ``` Created on 2020-04-03 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)
japhir commented 4 years ago

170126_170124_Sibren_run29-1426.zip

japhir commented 4 years ago

Looks like there are some issues with the old caf files now, which is very unfortunate. They are suddenly ALL marked as problematic files. When I run iso_get_file_info() on my whole set of cafs I get the following:

> iso_get_file_info(cafs)
Info: aggregating file info from 4928 data file(s)
Error: No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
Run `rlang::last_error()` to see where the error occurred.

> rlang::last_error()
<error/vctrs_error_incompatible_type>
No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
Backtrace:
  1. isoreader::iso_get_file_info(cafs)
 14. vctrs:::vec_ptype2.POSIXt.default(...)
 15. vctrs::vec_default_ptype2(x, y, x_arg = x_arg, y_arg = y_arg)
 16. vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
 17. vctrs:::stop_incompatible(...)
 18. vctrs:::stop_vctrs(...)
Run `rlang::last_trace()` to see the full context.

>rlang::last_trace()
<error/vctrs_error_incompatible_type>
No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
Backtrace:
     █
  1. ├─isoreader::iso_get_file_info(cafs)
  2. │ └─`%>%`(...)
  3. │   ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
  4. │   └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  5. │     └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  6. │       └─isoreader:::`_fseq`(`_lhs`)
  7. │         └─magrittr::freduce(value, `_function_list`)
  8. │           ├─base::withVisible(function_list[[k]](value))
  9. │           └─function_list[[k]](value)
 10. │             └─isoreader:::safe_bind_rows(.)
 11. │               └─vctrs::vec_rbind(...)
 12. ├─vctrs:::vec_ptype2_dispatch_s3(x = x, y = y, x_arg = x_arg, y_arg = y_arg)
 13. ├─vctrs::vec_ptype2.POSIXt(x = x, y = y, x_arg = x_arg, y_arg = y_arg)
 14. └─vctrs:::vec_ptype2.POSIXt.default(...)
 15.   └─vctrs::vec_default_ptype2(x, y, x_arg = x_arg, y_arg = y_arg)
 16.     └─vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
 17.       └─vctrs:::stop_incompatible(...)
 18.         └─vctrs:::stop_vctrs(...)
japhir commented 4 years ago

Ok I think I'm being stupid. I had this issue for my newest files first, then it was fixed after I rsync'd without the --ignore-existing flag. Now I had it for the older caf files, but hadn't removed the flag yet. Don't spend time trying to fix this yet please ;-).

sebkopf commented 4 years ago

sounds good. I do think there might be some dplyr issues with 0.8.5 (and the upcoming 1.0) that we need to address. The newest dplyr has implements bind_rows() in a new way that I'm pretty sure is crashing the iso_get_ functions for more complicated data columns like the file_datetime

japhir commented 4 years ago

Aww unfortunately that was not the problem. All my old caf files don't work anymore, even after double-checking that they were copied over correctly.

So iso_get_file_info() breaks for the caf files. For the did files it's just become very slow.

log of reading in the combined big cafs file with all 4928 caf files, resulting in 6300 errors ``` r library(isoreader) #> #> Attaching package: 'isoreader' #> The following object is masked from 'package:stats': #> #> filter setwd("~/SurfDrive/PhD/programming/dataprocessing") cafs <- iso_read_dual_inlet("out/cafs.di.rds") #> Info: preparing to read 1 data files (all will be cached)... #> Info: reading file 'out/cafs.di.rds' with '.di.rds' reader #> Info: loaded data for 4928 data files from R Data Storage - checking loaded... #> Info: finished reading 1 files in 13.08 secs #> Warning: Column `path` has different attributes on LHS and RHS of join #> Info: encountered 6300 problems in total #> # A tibble: 6,300 x 4 #> file_id type func details #> #> 1 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`… #> 2 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`… #> 3 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`… #> 4 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`… #> 5 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`… #> 6 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`… #> 7 170127_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`… #> 8 170127_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`… #> 9 170127_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`… #> 10 170127_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`… #> # … with 6,290 more rows iso_get_file_info(cafs) #> Info: aggregating file info from 4928 data file(s) #> Error: No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` > and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` . iso_get_raw_data(cafs) #> Info: aggregating raw data from 4928 data file(s) #> # A tibble: 74,674 x 9 #> file_id type cycle v44.mV v45.mV v46.mV v47.mV v48.mV v49.mV #> #> 1 170126_170124_Sibren_… stand… 0 13091. 15594. 18622. 2078. 216. -0.528 #> 2 170126_170124_Sibren_… stand… 1 12177. 14506. 17323. 1933. 201. -0.503 #> 3 170126_170124_Sibren_… stand… 2 11329. 13497. 16119. 1799. 187. -0.456 #> 4 170126_170124_Sibren_… stand… 3 10556. 12576. 15019. 1677. 174. -0.431 #> 5 170126_170124_Sibren_… stand… 4 9845. 11729. 14008. 1564. 163. -0.397 #> 6 170126_170124_Sibren_… stand… 5 9192. 10952. 13080. 1461. 152. -0.363 #> 7 170126_170124_Sibren_… stand… 6 8591. 10236. 12224. 1366. 142. -0.339 #> 8 170126_170124_Sibren_… stand… 7 8034. 9572. 11431. 1278. 133. -0.308 #> 9 170126_170124_Sibren_… stand… 8 7521. 8961. 10702. 1197. 124. -0.282 #> 10 170126_170124_Sibren_… sample 1 12661. 14953. 17854. 1974. 206. -0.509 #> # … with 74,664 more rows ``` Created on 2020-04-03 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)
japhir commented 4 years ago

Also, just using iso_read_dual_inlet() on one of these summary rds files is very slow, so probably some of the integrity checks have also broken? With read_rds() or readRDS() it's much faster.

sebkopf commented 4 years ago

I cannot reproduce your error even with your versions of dplyr and vctrs. Could this be an issue with the cached files? Can you run an example with read_cache = FALSE and quiet=FALSE so I can get a better sense for output?

japhir commented 4 years ago

Hmm that's very weird. I've just updated my system and vctrs, dplyr and isoreader, and even with this single caf file I get issues. Are you saying you don't get these warnings/errors with the one file on your system either? Or just not the error related to the different file_datetime formats?

new log running it without caching for one file ``` r library(isoreader) #> #> Attaching package: 'isoreader' #> The following object is masked from 'package:stats': #> #> filter cafs <- iso_read_dual_inlet("~/Downloads/170126_170124_Sibren_run29-1426.caf", cache = FALSE, read_cache = FALSE, quiet = FALSE, discard_duplicates = FALSE, parallel = FALSE) #> Info: preparing to read 1 data files... #> Info: reading file '170126_170124_Sibren_run29-1426.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: finished reading 1 files in 4.06 secs #> Warning: Column `path` has different attributes on LHS and RHS of join #> Info: encountered 1 problems in total #> # A tibble: 1 x 4 #> file_id type func details #> #> 1 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value` … iso_get_problems(cafs) #> # A tibble: 1 x 4 #> file_id type func details #> #> 1 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value` … iso_get_file_info(cafs) #> Info: aggregating file info from 1 data file(s) #> # A tibble: 1 x 7 #> file_id file_root file_path file_subpath file_datetime file_size #> #> 1 170126… /home/ja… 170126_1… 2017-01-26 20:29:47 651810 #> # … with 1 more variable: MS_integration_time.s sessionInfo() #> R version 3.6.3 (2020-02-29) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Arch Linux #> #> Matrix products: default #> BLAS: /usr/lib/libopenblasp-r0.3.9.so #> LAPACK: /usr/lib/liblapack.so.3.9.0 #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] isoreader_1.1.4 #> #> loaded via a namespace (and not attached): #> [1] zip_2.0.4 Rcpp_1.0.4 pillar_1.4.3 compiler_3.6.3 #> [5] highr_0.8 prettyunits_1.1.1 progress_1.2.2 R.methodsS3_1.8.0 #> [9] R.utils_2.9.2 base64enc_0.1-3 tools_3.6.3 digest_0.6.25 #> [13] rhdf5_2.30.1 lubridate_1.7.8 evaluate_0.14 lifecycle_0.2.0 #> [17] tibble_3.0.0 pkgconfig_2.0.3 rlang_0.4.5 openxlsx_4.1.4 #> [21] cli_2.0.2 yaml_2.2.1 parallel_3.6.3 xfun_0.12 #> [25] xml2_1.3.0 dplyr_0.8.5 stringr_1.4.0 knitr_1.28 #> [29] generics_0.0.2 vctrs_0.2.4 globals_0.12.5 hms_0.5.3 #> [33] tidyselect_1.0.0 glue_1.4.0 listenv_0.8.0 R6_2.4.1 #> [37] fansi_0.4.1 rmarkdown_2.1 tidyr_1.0.2 Rhdf5lib_1.8.0 #> [41] readr_1.3.1 purrr_0.3.3 magrittr_1.5 feather_0.3.5 #> [45] codetools_0.2-16 ellipsis_0.3.0 htmltools_0.4.0 assertthat_0.2.1 #> [49] future_1.16.0 UNF_2.0.6 utf8_1.1.4 stringi_1.4.6 #> [53] crayon_1.3.4 R.oo_1.23.0 ``` Created on 2020-04-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)
japhir commented 4 years ago
Another long log running it for 22 caf files, resulting in 24 errors ``` r library(isoreader) #> #> Attaching package: 'isoreader' #> The following object is masked from 'package:stats': #> #> filter dir.create("/tmp/rtmp") setwd("/tmp/rtmp") cafs <- iso_read_dual_inlet("~/Documents/archive/pacman/cafs/180522_Stds/", cache = FALSE, read_cache = FALSE, quiet = FALSE, discard_duplicates = FALSE, parallel = FALSE) #> Info: preparing to read 22 data files... #> Info: reading file '180522_Std_ETH-1_1.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-1_2.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-1_7.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-1_8.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-2_10.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-2_3.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-2_4.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-2_9.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-3_11.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-3_12.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-3_5.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180522_Std_ETH-3_6.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180523_Std_ETH-1_13.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180523_Std_ETH-1_14.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180523_Std_ETH-1_19.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180523_Std_ETH-1_20.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180523_Std_ETH-2_15.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180523_Std_ETH-2_16.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180523_Std_ETH-2_21.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180523_Std_ETH-2_22.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Warning: caught error - cannot identify measured masses - block 'CResultDat... #> Warning: caught error - cannot process vendor data table - block 'CResultDa... #> Info: reading file '180523_Std_ETH-3_17.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: reading file '180523_Std_ETH-3_18.caf' with '.caf' reader #> Warning: caught error - Assigned data `file_info$value` must be compatible ... #> Info: finished reading 22 files in 1.03 mins #> Warning: Column `path` has different attributes on LHS and RHS of join #> Info: encountered 24 problems in total #> # A tibble: 24 x 4 #> file_id type func details #> #> 1 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 2 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 3 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 4 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 5 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 6 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 7 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 8 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 9 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 10 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> # … with 14 more rows iso_get_problems(cafs) #> # A tibble: 24 x 4 #> file_id type func details #> #> 1 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 2 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 3 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 4 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 5 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 6 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 7 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 8 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 9 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> 10 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu… #> # … with 14 more rows iso_get_file_info(cafs) #> Info: aggregating file info from 22 data file(s) #> # A tibble: 22 x 7 #> file_id file_root file_path file_subpath file_datetime file_size #> #> 1 180522… /home/ja… 180522_S… 2018-05-22 17:24:52 651970 #> 2 180522… /home/ja… 180522_S… 2018-05-22 18:03:27 668650 #> 3 180522… /home/ja… 180522_S… 2018-05-22 21:14:55 668682 #> 4 180522… /home/ja… 180522_S… 2018-05-22 21:54:16 668678 #> 5 180522… /home/ja… 180522_S… 2018-05-22 23:11:42 669030 #> 6 180522… /home/ja… 180522_S… 2018-05-22 18:42:25 668992 #> 7 180522… /home/ja… 180522_S… 2018-05-22 19:21:29 669014 #> 8 180522… /home/ja… 180522_S… 2018-05-22 22:33:02 668970 #> 9 180522… /home/ja… 180522_S… 2018-05-22 23:47:26 652032 #> 10 180522… /home/ja… 180522_S… 2018-05-23 00:26:31 668710 #> # … with 12 more rows, and 1 more variable: MS_integration_time.s sessionInfo() #> R version 3.6.3 (2020-02-29) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Arch Linux #> #> Matrix products: default #> BLAS: /usr/lib/libopenblasp-r0.3.9.so #> LAPACK: /usr/lib/liblapack.so.3.9.0 #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] isoreader_1.1.4 #> #> loaded via a namespace (and not attached): #> [1] zip_2.0.4 Rcpp_1.0.4 pillar_1.4.3 compiler_3.6.3 #> [5] highr_0.8 prettyunits_1.1.1 progress_1.2.2 R.methodsS3_1.8.0 #> [9] R.utils_2.9.2 base64enc_0.1-3 tools_3.6.3 digest_0.6.25 #> [13] rhdf5_2.30.1 lubridate_1.7.8 evaluate_0.14 lifecycle_0.2.0 #> [17] tibble_3.0.0 pkgconfig_2.0.3 rlang_0.4.5 openxlsx_4.1.4 #> [21] cli_2.0.2 yaml_2.2.1 parallel_3.6.3 xfun_0.12 #> [25] xml2_1.3.0 dplyr_0.8.5 stringr_1.4.0 knitr_1.28 #> [29] generics_0.0.2 vctrs_0.2.4 globals_0.12.5 hms_0.5.3 #> [33] tidyselect_1.0.0 glue_1.4.0 listenv_0.8.0 R6_2.4.1 #> [37] fansi_0.4.1 rmarkdown_2.1 tidyr_1.0.2 Rhdf5lib_1.8.0 #> [41] readr_1.3.1 purrr_0.3.3 magrittr_1.5 feather_0.3.5 #> [45] codetools_0.2-16 ellipsis_0.3.0 htmltools_0.4.0 assertthat_0.2.1 #> [49] future_1.16.0 UNF_2.0.6 utf8_1.1.4 stringi_1.4.6 #> [53] crayon_1.3.4 R.oo_1.23.0 ``` Created on 2020-04-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)

also: I edited all above posts to use the <details> flags so the long logs are collapsed by default.

japhir commented 4 years ago

Maybe it's because you ran it on the did files in https://github.com/isoverse/isoreader/issues/110#issuecomment-608047716 in stead of the single caf file?

sebkopf commented 4 years ago

found it, it's tibble 3.0!!

sebkopf commented 4 years ago

Hi @japhir , can you try if devtools::install_github("isoverse/isoreader", ref = "dev") solves the problem?

japhir commented 4 years ago

That's great! Thanks for implementing a fix so soon. I've updated to the dev version again for now, and it appears to be working. It now doesn't give me the LHS and RHS errors when I read in the files, but the old summary files and cached files are very slow still. It looks like I'm going to have to re-import all (caf) files again with the read_cache flag off, because applying iso_get_file_info() to the cached versions still results in an error. That'll take a while so I'll get back to you on that later.

re-import of one folder with caf files ``` r library(isoreader) #> #> Attaching package: 'isoreader' #> The following object is masked from 'package:stats': #> #> filter dir.create("/tmp/rtmp") setwd("/tmp/rtmp") cafs <- iso_read_dual_inlet("~/Documents/archive/pacman/cafs/180522_Stds/", cache = FALSE, read_cache = FALSE, quiet = FALSE, discard_duplicates = FALSE, parallel = FALSE) #> Info: preparing to read 22 data files... #> Info: reading file '180522_Std_ETH-1_1.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-1_2.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-1_7.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-1_8.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-2_10.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-2_3.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-2_4.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-2_9.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-3_11.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-3_12.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-3_5.caf' with '.caf' reader #> Info: reading file '180522_Std_ETH-3_6.caf' with '.caf' reader #> Info: reading file '180523_Std_ETH-1_13.caf' with '.caf' reader #> Info: reading file '180523_Std_ETH-1_14.caf' with '.caf' reader #> Info: reading file '180523_Std_ETH-1_19.caf' with '.caf' reader #> Info: reading file '180523_Std_ETH-1_20.caf' with '.caf' reader #> Info: reading file '180523_Std_ETH-2_15.caf' with '.caf' reader #> Info: reading file '180523_Std_ETH-2_16.caf' with '.caf' reader #> Info: reading file '180523_Std_ETH-2_21.caf' with '.caf' reader #> Info: reading file '180523_Std_ETH-2_22.caf' with '.caf' reader #> Warning: caught error - cannot identify measured masses - block 'CResultDat... #> Warning: caught error - cannot process vendor data table - block 'CResultDa... #> Info: reading file '180523_Std_ETH-3_17.caf' with '.caf' reader #> Info: reading file '180523_Std_ETH-3_18.caf' with '.caf' reader #> Info: finished reading 22 files in 57.35 secs #> Info: encountered 2 problems in total #> # A tibble: 2 x 4 #> file_id type func details #> #> 1 180523_Std_ETH… error extract_caf_raw… cannot identify measured masses - bloc… #> 2 180523_Std_ETH… error extract_caf_ven… cannot process vendor data table - blo… iso_get_problems(cafs) #> # A tibble: 2 x 4 #> file_id type func details #> #> 1 180523_Std_ETH… error extract_caf_raw… cannot identify measured masses - bloc… #> 2 180523_Std_ETH… error extract_caf_ven… cannot process vendor data table - blo… iso_get_file_info(cafs) #> Info: aggregating file info from 22 data file(s) #> # A tibble: 22 x 22 #> file_id file_root file_path file_subpath file_datetime file_size Line #> #> 1 180522… /home/ja… 180522_S… 2018-05-22 17:24:52 651970 1 #> 2 180522… /home/ja… 180522_S… 2018-05-22 18:03:27 668650 2 #> 3 180522… /home/ja… 180522_S… 2018-05-22 21:14:55 668682 1 #> 4 180522… /home/ja… 180522_S… 2018-05-22 21:54:16 668678 2 #> 5 180522… /home/ja… 180522_S… 2018-05-22 23:11:42 669030 2 #> 6 180522… /home/ja… 180522_S… 2018-05-22 18:42:25 668992 1 #> 7 180522… /home/ja… 180522_S… 2018-05-22 19:21:29 669014 2 #> 8 180522… /home/ja… 180522_S… 2018-05-22 22:33:02 668970 1 #> 9 180522… /home/ja… 180522_S… 2018-05-22 23:47:26 652032 1 #> 10 180522… /home/ja… 180522_S… 2018-05-23 00:26:31 668710 2 #> # … with 12 more rows, and 15 more variables: `Peak Center` , #> # Pressadjust , Background , `Reference Refill` , `Weight #> # [mg]` , Sample , `Identifier 1` , `Identifier 2` , #> # Analysis , Comment , Preparation , `Pre Script` , `Post #> # Script` , Method , MS_integration_time.s sessionInfo() #> R version 3.6.3 (2020-02-29) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Arch Linux #> #> Matrix products: default #> BLAS: /usr/lib/libopenblasp-r0.3.9.so #> LAPACK: /usr/lib/liblapack.so.3.9.0 #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] isoreader_1.1.5 #> #> loaded via a namespace (and not attached): #> [1] zip_2.0.4 Rcpp_1.0.4 pillar_1.4.3 compiler_3.6.3 #> [5] highr_0.8 prettyunits_1.1.1 progress_1.2.2 R.methodsS3_1.8.0 #> [9] R.utils_2.9.2 base64enc_0.1-3 tools_3.6.3 digest_0.6.25 #> [13] rhdf5_2.30.1 lubridate_1.7.8 evaluate_0.14 lifecycle_0.2.0 #> [17] tibble_3.0.0 pkgconfig_2.0.3 rlang_0.4.5 openxlsx_4.1.4 #> [21] cli_2.0.2 yaml_2.2.1 parallel_3.6.3 xfun_0.12 #> [25] xml2_1.3.0 dplyr_0.8.5 stringr_1.4.0 knitr_1.28 #> [29] generics_0.0.2 vctrs_0.2.4 globals_0.12.5 hms_0.5.3 #> [33] tidyselect_1.0.0 glue_1.4.0 listenv_0.8.0 R6_2.4.1 #> [37] fansi_0.4.1 rmarkdown_2.1 tidyr_1.0.2 Rhdf5lib_1.8.0 #> [41] readr_1.3.1 purrr_0.3.3 magrittr_1.5 feather_0.3.5 #> [45] codetools_0.2-16 ellipsis_0.3.0 htmltools_0.4.0 assertthat_0.2.1 #> [49] future_1.16.0 UNF_2.0.6 utf8_1.1.4 stringi_1.4.6 #> [53] crayon_1.3.4 R.oo_1.23.0 ``` Created on 2020-04-08 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)
japhir commented 4 years ago

It's just finished re-reading the 4928 caf files! It has now found 1376 files with problems, of which a lot are duplicate files.

I get the below warning when I saved the aggregate file to rds with iso_save()

Warning messages:
1: In max(.data$pos) : no non-missing arguments to max; returning -Inf
2: In max(.data$pos) : no non-missing arguments to max; returning -Inf
3: Unknown or uninitialised column: `block`.
4: Unknown or uninitialised column: `block`.
5: Unknown or uninitialised column: `block`.

Reading in the newly created summary rds file is still slow (20.43 secs! vs read_rds which is basically instantaneous).

iso_get_file_info() unfortunately fails still on the caf files :cry:.

logs of importing the summarized caf di file ``` r library(isoreader) #> #> Attaching package: 'isoreader' #> The following object is masked from 'package:stats': #> #> filter setwd("~/SurfDrive/PhD/programming/dataprocessing") cafs <- iso_read_dual_inlet("out/cafs.di.rds") #> Info: preparing to read 1 data files (all will be cached)... #> Info: reading file 'out/cafs.di.rds' with '.di.rds' reader #> Info: loaded data for 4928 data files from R Data Storage - checking loaded... #> Info: finished reading 1 files in 19.39 secs #> Info: encountered 1376 problems in total #> # A tibble: 1,376 x 4 #> file_id type func details #> #> 1 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses … #> 2 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table… #> 3 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses … #> 4 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table… #> 5 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses … #> 6 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table… #> 7 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses … #> 8 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table… #> 9 170127_170127_170124_… error extract_caf_r… cannot identify measured masses … #> 10 170127_170127_170124_… error extract_caf_v… cannot process vendor data table… #> # … with 1,366 more rows iso_get_problems(cafs) #> # A tibble: 1,376 x 4 #> file_id type func details #> #> 1 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses … #> 2 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table… #> 3 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses … #> 4 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table… #> 5 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses … #> 6 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table… #> 7 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses … #> 8 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table… #> 9 170127_170127_170124_… error extract_caf_r… cannot identify measured masses … #> 10 170127_170127_170124_… error extract_caf_v… cannot process vendor data table… #> # … with 1,366 more rows iso_get_file_info(cafs) #> Info: aggregating file info from 4928 data file(s) #> Error: No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` > and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` . sessionInfo() #> R version 3.6.3 (2020-02-29) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Arch Linux #> #> Matrix products: default #> BLAS: /usr/lib/libopenblasp-r0.3.9.so #> LAPACK: /usr/lib/liblapack.so.3.9.0 #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] isoreader_1.1.5 #> #> loaded via a namespace (and not attached): #> [1] zip_2.0.4 Rcpp_1.0.4 pillar_1.4.3 compiler_3.6.3 #> [5] highr_0.8 prettyunits_1.1.1 progress_1.2.2 R.methodsS3_1.8.0 #> [9] R.utils_2.9.2 base64enc_0.1-3 tools_3.6.3 digest_0.6.25 #> [13] rhdf5_2.30.1 lubridate_1.7.8 evaluate_0.14 lifecycle_0.2.0 #> [17] tibble_3.0.0 pkgconfig_2.0.3 rlang_0.4.5 openxlsx_4.1.4 #> [21] cli_2.0.2 yaml_2.2.1 parallel_3.6.3 xfun_0.12 #> [25] xml2_1.3.0 dplyr_0.8.5 stringr_1.4.0 knitr_1.28 #> [29] generics_0.0.2 vctrs_0.2.4 globals_0.12.5 hms_0.5.3 #> [33] tidyselect_1.0.0 glue_1.4.0 listenv_0.8.0 R6_2.4.1 #> [37] fansi_0.4.1 rmarkdown_2.1 tidyr_1.0.2 Rhdf5lib_1.8.0 #> [41] readr_1.3.1 purrr_0.3.3 magrittr_1.5 feather_0.3.5 #> [45] codetools_0.2-16 ellipsis_0.3.0 htmltools_0.4.0 assertthat_0.2.1 #> [49] future_1.16.0 UNF_2.0.6 utf8_1.1.4 stringi_1.4.6 #> [53] crayon_1.3.4 R.oo_1.23.0 ``` Created on 2020-04-08 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)
japhir commented 4 years ago

Of course I should have just limited it to the two files that are actually indicated to be the error message. That would have saved me 2 hours of unnecessary computation ;-). Anyway, here they are problematic_2_files.zip

included reprex ran on only those two files ``` r library(isoreader) #> #> Attaching package: 'isoreader' #> The following object is masked from 'package:stats': #> #> filter setwd("~/Downloads") cafs <- iso_read_dual_inlet("problematic_2_files", cache = FALSE, read_cache = FALSE, quiet = FALSE, discard_duplicates = FALSE, parallel = FALSE) #> Info: preparing to read 2 data files... #> Info: reading file 'problematic_2_files/170126_170124_Sibren_run29-1426.caf... #> Info: reading file 'problematic_2_files/170621_170522_Guido_Magda_ETH-1-000... #> Warning: caught error - no C_blocks available #> Warning: caught error - no C_blocks available #> Warning: Unknown or uninitialised column: `block`. #> Warning: caught error - no C_blocks available #> Warning: caught error - no C_blocks available #> Warning: caught error - no C_blocks available #> Warning: caught error - no C_blocks available #> Warning: caught error - no C_blocks available #> Info: finished reading 2 files in 3.38 secs #> Info: encountered 7 problems in total #> # A tibble: 7 x 4 #> file_id type func details #> #> 1 170621_170522_Guido_Magda_ET… error extract_isodat_old_seque… no C_blocks ava… #> 2 170621_170522_Guido_Magda_ET… error extract_isodat_datetime no C_blocks ava… #> 3 170621_170522_Guido_Magda_ET… error extract_MS_integration_t… no C_blocks ava… #> 4 170621_170522_Guido_Magda_ET… error extract_caf_raw_voltage_… no C_blocks ava… #> 5 170621_170522_Guido_Magda_ET… error extract_isodat_reference… no C_blocks ava… #> 6 170621_170522_Guido_Magda_ET… error extract_isodat_resistors no C_blocks ava… #> 7 170621_170522_Guido_Magda_ET… error extract_caf_vendor_data_… no C_blocks ava… iso_get_problems(cafs) #> # A tibble: 7 x 4 #> file_id type func details #> #> 1 170621_170522_Guido_Magda_ET… error extract_isodat_old_seque… no C_blocks ava… #> 2 170621_170522_Guido_Magda_ET… error extract_isodat_datetime no C_blocks ava… #> 3 170621_170522_Guido_Magda_ET… error extract_MS_integration_t… no C_blocks ava… #> 4 170621_170522_Guido_Magda_ET… error extract_caf_raw_voltage_… no C_blocks ava… #> 5 170621_170522_Guido_Magda_ET… error extract_isodat_reference… no C_blocks ava… #> 6 170621_170522_Guido_Magda_ET… error extract_isodat_resistors no C_blocks ava… #> 7 170621_170522_Guido_Magda_ET… error extract_caf_vendor_data_… no C_blocks ava… iso_get_file_info(cafs) #> Info: aggregating file info from 2 data file(s) #> Error: No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` > and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` . sessionInfo() #> R version 3.6.3 (2020-02-29) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Arch Linux #> #> Matrix products: default #> BLAS: /usr/lib/libopenblasp-r0.3.9.so #> LAPACK: /usr/lib/liblapack.so.3.9.0 #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] isoreader_1.1.5 #> #> loaded via a namespace (and not attached): #> [1] zip_2.0.4 Rcpp_1.0.4 pillar_1.4.3 compiler_3.6.3 #> [5] highr_0.8 prettyunits_1.1.1 progress_1.2.2 R.methodsS3_1.8.0 #> [9] R.utils_2.9.2 base64enc_0.1-3 tools_3.6.3 digest_0.6.25 #> [13] rhdf5_2.30.1 lubridate_1.7.8 evaluate_0.14 lifecycle_0.2.0 #> [17] tibble_3.0.0 pkgconfig_2.0.3 rlang_0.4.5 openxlsx_4.1.4 #> [21] cli_2.0.2 yaml_2.2.1 parallel_3.6.3 xfun_0.12 #> [25] xml2_1.3.0 dplyr_0.8.5 stringr_1.4.0 knitr_1.28 #> [29] generics_0.0.2 vctrs_0.2.4 globals_0.12.5 hms_0.5.3 #> [33] tidyselect_1.0.0 glue_1.4.0 listenv_0.8.0 R6_2.4.1 #> [37] fansi_0.4.1 rmarkdown_2.1 tidyr_1.0.2 Rhdf5lib_1.8.0 #> [41] readr_1.3.1 purrr_0.3.3 magrittr_1.5 feather_0.3.5 #> [45] codetools_0.2-16 ellipsis_0.3.0 htmltools_0.4.0 assertthat_0.2.1 #> [49] future_1.16.0 UNF_2.0.6 utf8_1.1.4 stringi_1.4.6 #> [53] crayon_1.3.4 R.oo_1.23.0 ``` Created on 2020-04-08 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)
japhir commented 4 years ago

This seems resolved now, also in the master branch! Read/save speeds of the raw files are back to before and I don't get errors on saving the rds! Running iso_read_dual_inlet() on the saved rds file is still slower than a plain read_rds(), however.

sebkopf commented 4 years ago

Hi @japhir . The whole cache file system is actually revamped in the release yesterday (1.2.0) so cache files can be copied, have more useful files names to know which is which and allow skipping the data integrity checks for files that are up to date (should make reading .rds files similarly fast to direct readRDS). You do need to re-generate your cache but it's easy now with iso_reread_oudated_files(iso_files) and hopefully the last time for a long time we need to make any structural changes like this. Would love to hear if it works.

By the way, notifications about isoverse are no in a repo for this purpose, take a look: https://github.com/isoverse/news/issues/2

japhir commented 4 years ago

Hi @sebkopf, thanks for the notice. I've just updated to R 4.0 and the newest isoreader, but I think something must have gone wrong somewhere… re-reading the whole database took about twice as long as last time (with very few new files, as you can imagine) and while iso_get_file_info() works, it's also much slower than before. iso_get_raw_data() hasn't finished as I'm typing this...

How can I help debug this?

sebkopf commented 4 years ago

Hi @japhir, can you send a small excerpt of your entire collection? Nothing has changed in iso_get_file_info() but I have not yet tested any of these in R 4.0. I think the new R does a lot more type cast checks that might make those built into iso_get_file_info() redundant (but also makes it slower in R 4.0 since they're essentially done twice). The changes with the caching should just make reading cached files and .rds files faster, not inherently the data collection. The speed of iso_get_raw_data() is mostly limited by how quick tidyverse functions work (unless you also bring in all the file info) and with the switch to vctrs some things have gotten slower :( - not sure yet whether the benefit of type cast checks in vctrs really outweighs the speed losses

japhir commented 4 years ago

Just finished reading in everything. Didn't get any particular warnings on the newer did files, but got these on the caf files: (again, much slower than before).

Info: exporting data from 4928 iso_files into R Data Storage '/home/japhir/SurfDrive/PhD/programming/dataprocessing/out/cafs.di.rds'
Warning messages:
1: In max(.data$pos) : no non-missing arguments to max; returning -Inf
2: In max(.data$pos) : no non-missing arguments to max; returning -Inf
3: Unknown or uninitialised column: `block`.
4: Unknown or uninitialised column: `block`.
5: Unknown or uninitialised column: `block`.

All of the previously shared files in this thread should be good, the raw data haven't chaged. How big of a subset were you thinking? I was hesitant to share many earlier, but just asked my supervisor and he says it shouldn't be a problem to share some files.

sebkopf commented 4 years ago

that's great! I was actually thinking not the raw files since they don't cause trouble for me but just parts of the isofile collection, so something like this:

iso_files <- iso_read_dual_inlet("....rds")
# pick 100 random files from the collection
iso_files[sample(1:length(iso_files), 100)] %>% iso_save("for_seb.di.rds")

as for that .data$pos warning, could you see if you can pinpoint where it occurs with the following flags to elevate warnings to errors and not catch them?

options(warn = 2)
isoreader:::iso_turn_debug_on(catch_errors = FALSE)
iso_read_dual_inlet(....)
japhir commented 4 years ago

Ok @sebkopf, here's the test file with 100 standards!

for_seb.di.zip

I generated them like this:

  seb_sub <- dids %>%
    iso_filter_files(Comment == "STD")  # for standard
  # evenly spaced throughout the record, not sure if it's sorted by file_datetime though, 
  # so could still be random.
  seb_sub <- seb_sub[(floor(seq(1, length(seb_sub), length.out = 100)))] %>%  
    iso_save("out/for_seb.di.rds")

I tried to have a look at where it's getting slow with profvis, but I don't really understand the graph so I'll leave that up to you ;-)

  library(profvis)
  library(isoreader)
  profvis({
    dids <- iso_read_dual_inlet("out/for_seb.di.rds")

    didinfo <- dids %>%
      iso_get_file_info()

    rawdata <- dids %>%
      iso_get_raw_data()
  })
output on my machine ```r Attaching package: ‘isoreader’ The following object is masked from ‘package:stats’: filter Progress: [-----------------------------------------------------------------------------------------] 0/1 ( 0%) 0s Info: preparing to read 1 data files (all will be cached)... Progress: [-----------------------------------------------------------------------------------------] 0/1 ( 0%) 0s Info: reading file 'out/for_seb.di.rds' with '.di.rds' reader... Progress: [-----------------------------------------------------------------------------------------] 0/1 ( 0%) 0s Info: loaded 100 data files from R Data Storage Progress: [-----------------------------------------------------------------------------------------] 0/1 ( 0%) 0s Progress: [=========================================================================================] 1/1 (100%) 0s Info: finished reading 1 files in 0.19 secs Info: encountered 19 problems in total # A tibble: 19 x 4 file_id type func details      1 180223_1_Kiel Std tes… error extract_did_raw_v… cannot locate voltage data - block 'CTwoDoublesArrayData' not fo…  2 180223_1_Kiel Std tes… error extract_did_vendo… cannot process vendor computed data table - block 'CDualInletEva…  3 180517_29_RobinV_5_ET… warning iso_as_file_list duplicate files kept but with recoded file IDs: 180517_29_RobinV…  4 180621_47_Chris_14_ET… error extract_did_raw_v… cannot locate voltage data - block 'CTwoDoublesArrayData' not fo…  5 180621_47_Chris_14_ET… error extract_did_vendo… cannot process vendor computed data table - block 'CDualInletEva…  6 180621_47_Chris_14_ET… warning iso_as_file_list duplicate files kept but with recoded file IDs: 180621_47_Chris_…  7 180903_83_Cas_19_ETH-… warning iso_as_file_list duplicate files kept but with recoded file IDs: 180903_83_Cas_19…  8 180915_88_WuyunCas_39… warning iso_as_file_list duplicate files kept but with recoded file IDs: 180915_88_WuyunC…  9 180929_94_Ilja_37_ETH… warning iso_as_file_list duplicate files kept but with recoded file IDs: 180929_94_Ilja_3… 10 190514_195_NdW_25_ETH… warning iso_as_file_list duplicate files kept but with recoded file IDs: 190514_195_NdW_2… 11 190805_237_RvdP_5_ETH… error extract_did_raw_v… cannot locate voltage data - block 'CTwoDoublesArrayData' not fo… 12 190805_237_RvdP_5_ETH… error extract_did_vendo… cannot process vendor computed data table - block 'CDualInletEva… 13 191125_295_MM_16_ETH-… error extract_did_raw_v… cannot locate voltage data - block 'CTwoDoublesArrayData' not fo… 14 191125_295_MM_16_ETH-… error extract_did_vendo… cannot process vendor computed data table - block 'CDualInletEva… 15 200110_311_NdW_43_ETH… warning iso_as_file_list duplicate files kept but with recoded file IDs: 200110_311_NdW_4… 16 180316_4_Std test_6_E… warning iso_as_file_list duplicate files kept but with recoded file IDs: 180316_4_Std tes… 17 180831_83_Cas_9_ETH-3… error extract_did_raw_v… cannot locate voltage data - block 'CTwoDoublesArrayData' not fo… 18 180831_83_Cas_9_ETH-3… error extract_did_vendo… cannot process vendor computed data table - block 'CDualInletEva… 19 180831_83_Cas_9_ETH-3… warning iso_as_file_list duplicate files kept but with recoded file IDs: 180831_83_Cas_9_… Info: aggregating file info from 100 data file(s) Info: aggregating raw data from 100 data file(s) ```
japhir commented 4 years ago

regarding the debugging request: this doesn't work because of the duplicated files

  options(warn = 2)
  isoreader:::iso_turn_debug_on(catch_errors = FALSE)
  setwd("~/Documents/archive/")
  isoreader::iso_read_dual_inlet("~/Documents/archive/pacman/cafs", 
                                 discard_duplicates = FALSE)
output ``` r Info: debug mode turned on, error catching turned off, caching turned off Error: (converted from warning) some files from different folders have identical file names: ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(1).caf ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(2).caf ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(3).caf ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(4).caf ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(5).caf ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(6).caf ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(7).caf ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(8).caf ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(9).caf ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/deel2/170402_Sibren_8(1).caf ~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/deel2/170402_Sibren_8(2).caf ~/Documents/a ```