Open pratarora opened 3 months ago
Thanks for adding this issue. I'm currently working on a solution to import/process large json files in a chunk-wise manner.
Hi @pratarora , I've added a new parameter n
to compound_tbl_lipidblast
that you can use to define the number of lines of a (large) json file that should be loaded and processed at a time. This should help avoiding the memory problems you reported.
Thus, the code below (with n = 100000
) should work on the full file (it worked on my computer).
fl <-("MoNA-export-LipidBlast_2022.json")
cmps <- compound_tbl_lipidblast(fl, n = 100000, verbose = TRUE)
For now, you would need to install this version of the package from github using BiocManager::install("RforMassSpectrometry/CompoundDb")
(assuming you have the BiocManager package installed - which you can with install.packages("BiocManager")
.
Hi Johannes
Thanks a lot for the quick fix.
I tried it, but it works partially. The exactmass column is converted to character while reading. Inchi and inchikey is not read, I can change exactmass manually to numeric and get the sqlite file.
cmps %>% str
tibble [1,346,798 × 7] (S3: tbl_df/tbl/data.frame)
$ compound_id: chr [1:1346798] "LipidBlast2022_999999" "LipidBlast2022_999998" "LipidBlast2022_999997" "LipidBlast2022_999996" ...
$ name : chr [1:1346798] "HexCer 27:3;3O/15:0;(2OH)" "HexCer 27:3;3O/14:1;(2OH)" "HexCer 27:3;3O/14:0;(2OH)" "HexCer 27:3;3O/13:1;(2OH)" ...
$ inchi : chr [1:1346798] "" "" "" "" ...
$ inchikey : chr [1:1346798] NA NA NA NA ...
$ formula : chr [1:1346798] "C48H89NO10" "C47H85NO10" "C47H87NO10" "C46H83NO10" ...
$ exactmass : chr [1:1346798] "839.648648048" "823.6173479199999" "825.632997984" "809.601697856" ...
$ synonyms :List of 1346798
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
..$ : chr NA
.. [list output truncated]
oooooh, thanks for the report! I will check that. Should be numeric of course!
I am working on it, I will post the code for it soon.
should be fixed now. thanks for reporting!
Hi Johannes,
I modified the code. I tried to make it a bit parallel as well. It can now read all the fields for both the old lipidblast (https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/70d4b8c0-4863-4632-ade4-0ace697bdaa1) and the 2022 one (https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/d0899a6d-15b7-46ad-a836-f07313465fb8). I will create a pull request if it seems ok for you.
.parse_lipidblast_json_element <- function(x) {
id <- x$id
cmp <- x$compound[[1]]
# Helper function to extract metadata from multiple sources
get_metadata <- function(name, sources) {
for (source in sources) {
value <- unlist(lapply(source, function(z) {
if (z$name == name) z$value
}))
if (length(value) > 0 && !all(value == "")) {
return(value[1])
}
}
return(NA_character_)
}
# Define metadata sources
metadata_sources <- list(x$metaData, cmp$metaData)
# Get names
nms <- vapply(cmp$names, `[[`, "name", FUN.VALUE = "character")
snms <- if (length(nms) > 1L) nms[-1L] else NA_character_
# Extract metadata
mass <- get_metadata("total exact mass", metadata_sources)
mass <- as.numeric(mass)
if (is.na(mass)) mass <- NA_real_
frml <- get_metadata("molecular formula", metadata_sources)
inchi <- get_metadata("InChI", metadata_sources)
inchikey <- get_metadata("InChIKey", metadata_sources)
smiles <- get_metadata("SMILES", metadata_sources)
compound_class <- get_metadata("compound class", metadata_sources)
ionization_mode <- get_metadata("ionization mode", metadata_sources)
precursor_mz <- as.numeric(get_metadata("precursor m/z", metadata_sources))
precursor_type <- get_metadata("precursor type", metadata_sources)
retention_time <- as.numeric(get_metadata("retention time", metadata_sources))
ccs <- as.numeric(get_metadata("ccs", metadata_sources))
# Extract spectrum data
spectrum <- x$spectrum
if (is.null(spectrum)) spectrum <- NA_character_
list(
compound_id = id,
name = nms[1],
inchi = inchi,
inchikey = inchikey,
smiles = smiles,
formula = frml,
exactmass = mass,
compound_class = compound_class,
ionization_mode = ionization_mode,
precursor_mz = precursor_mz,
precursor_type = precursor_type,
retention_time = retention_time,
ccs = ccs,
spectrum = spectrum,
synonyms = list(snms)
)
}
.import_lipidblast <- function(file, n = -1L, verbose = FALSE, BPPARAM = bpparam()) {
if (n < 0) {
lipidb <- jsonlite::read_json(file)
if (verbose)
message("Processing ", length(lipidb), " elements ...", appendLF = FALSE)
# Use BiocParallel to process elements in parallel
res <- bplapply(lipidb, .parse_lipidblast_json_element, BPPARAM = BPPARAM)
if (verbose) message(" done.")
} else {
res <- .import_lipidblast_json_chunk(file, n = n, verbose = verbose, BPPARAM = BPPARAM)
}
dplyr::bind_rows(res)
}
.import_lipidblast_json_chunk <- function(x, n = 10000, verbose = FALSE, BPPARAM = bpparam()) {
con <- file(x, open = "r")
on.exit(close(con))
res <- list()
while (length(ls <- readLines(con, n = n, warn = FALSE))) {
if (length(grep("^\\[", ls[1L])))
ls <- ls[-1L]
if (length(grep("^\\]", ls[length(ls)])))
ls <- ls[-length(ls)]
ls <- sub(",$", "", ls)
if (length(ls)) {
chunk_res <- bplapply(ls, function(z) {
.parse_lipidblast_json_element(jsonlite::fromJSON(z, simplifyVector = FALSE))
}, BPPARAM = BPPARAM)
res <- c(res, chunk_res)
}
if (verbose)
message("Processed ", length(ls), " elements")
}
res
}
compound_tbl_lipidblast <- function(file, collapse = character(), n = -1L,
verbose = FALSE, BPPARAM = bpparam()) {
.check_parameter_file(file)
res <- .import_lipidblast(file, n = n, verbose = verbose, BPPARAM = BPPARAM)
if (length(collapse)) {
## collapse elements from lists.
res$synonyms <- vapply(res$synonyms, paste0, collapse = collapse[1L],
FUN.VALUE = "character")
}
res
}
new lipidblast
old lipidblast
Hi Johannes, I created the pull request: https://github.com/rformassspectrometry/CompoundDb/pull/117
Hi
Thanks for such a great package!
I am trying to make a compounddb object from lipidblast-2022 data (https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/d0899a6d-15b7-46ad-a836-f07313465fb8). But it keeps on failing, with an empty data frame as an output. I first thought it was memory issue, so I tried with bigger memory(256GB, it doesn't work).
I also tried to split the json to smaller chunks, in this case it can read the file but it gives an empty data frame. I have checked the structure of the json file it seems to be fine as the original.
any idea what i might be missing? Any place i could directly find the sqlite file?
my session info is :
R version 4.3.2 (2023-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC tzcode source: system (glibc)
attached base packages: [1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages: [1] jsonlite_1.8.8 mzR_2.36.0 Rcpp_1.0.12
[4] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[7] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[10] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[13] tidyverse_2.0.0 CompoundDb_1.6.0 AnnotationFilter_1.26.0 [16] MetaboCoreUtils_1.10.0 MetaboAnnotation_1.6.1 Spectra_1.12.0
[19] ProtGenerics_1.34.0 BiocParallel_1.36.0 S4Vectors_0.40.2
[22] BiocGenerics_0.48.1
loaded via a namespace (and not attached): [1] MultiAssayExperiment_1.28.0 magrittr_2.0.3
[3] fs_1.6.4 zlibbioc_1.48.2
[5] vctrs_0.6.5 memoise_2.0.1
[7] RCurl_1.98-1.14 base64enc_0.1-3
[9] htmltools_0.5.8.1 S4Arrays_1.2.1
[11] AnnotationHub_3.10.0 curl_5.2.1
[13] SparseArray_1.2.4 htmlwidgets_1.6.4
[15] cachem_1.0.8 uuid_1.2-0
[17] igraph_2.0.3 mime_0.12
[19] lifecycle_1.0.4 pkgconfig_2.0.3
[21] Matrix_1.6-5 R6_2.5.1
[23] fastmap_1.1.1 GenomeInfoDbData_1.2.11
[25] MatrixGenerics_1.14.0 shiny_1.8.0
[27] clue_0.3-65 digest_0.6.35
[29] rsvg_2.6.0 colorspace_2.1-0
[31] AnnotationDbi_1.64.1 GenomicRanges_1.54.1
[33] RSQLite_2.3.6 filelock_1.0.3
[35] fansi_1.0.6 timechange_0.3.0
[37] httr_1.4.7 abind_1.4-5
[39] compiler_4.3.2 bit64_4.0.5
[41] withr_3.0.0 DBI_1.2.2
[43] MASS_7.3-60 ChemmineR_3.54.0
[45] rappdirs_0.3.3 DelayedArray_0.28.0
[47] rjson_0.2.21 tools_4.3.2
[49] interactiveDisplayBase_1.40.0 httpuv_1.6.14
[51] glue_1.7.0 QFeatures_1.12.0
[53] promises_1.2.1 grid_4.3.2
[55] pbdZMQ_0.3-11 cluster_2.1.4
[57] generics_0.1.3 gtable_0.3.5
[59] tzdb_0.4.0 hms_1.1.3
[61] xml2_1.3.6 utf8_1.2.4
[63] XVector_0.42.0 BiocVersion_3.18.1
[65] pillar_1.9.0 IRdisplay_1.1
[67] later_1.3.2 BiocFileCache_2.10.1
[69] lattice_0.22-5 bit_4.0.5
[71] tidyselect_1.2.1 Biostrings_2.70.3
[73] gridExtra_2.3 IRanges_2.36.0
[75] SummarizedExperiment_1.32.0 Biobase_2.62.0
[77] matrixStats_1.3.0 DT_0.32
[79] stringi_1.8.4 lazyeval_0.2.2
[81] yaml_2.3.8 evaluate_0.23
[83] codetools_0.2-19 MsCoreUtils_1.14.1
[85] BiocManager_1.30.23 cli_3.6.2
[87] IRkernel_1.3.2 xtable_1.8-4
[89] repr_1.1.7 munsell_0.5.1
[91] GenomeInfoDb_1.38.8 dbplyr_2.5.0
[93] png_0.1-8 parallel_4.3.2
[95] ellipsis_0.3.2 blob_1.2.4
[97] bitops_1.0-7 scales_1.3.0
[99] ncdf4_1.22 crayon_1.5.2
[101] rlang_1.1.3 KEGGREST_1.42.0
I split the json file in python so that i can avoid loading on memory. The code i used to split the json: