rformassspectrometry / CompoundDb

Creating and using (chemical) compound databases
https://rformassspectrometry.github.io/CompoundDb/index.html
16 stars 16 forks source link

Unable to read lipidblast-2022 data #114

Open pratarora opened 3 months ago

pratarora commented 3 months ago

Hi

Thanks for such a great package!

I am trying to make a compounddb object from lipidblast-2022 data (https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/d0899a6d-15b7-46ad-a836-f07313465fb8). But it keeps on failing, with an empty data frame as an output. I first thought it was memory issue, so I tried with bigger memory(256GB, it doesn't work).

I also tried to split the json to smaller chunks, in this case it can read the file but it gives an empty data frame. I have checked the structure of the json file it seems to be fine as the original.

any idea what i might be missing? Any place i could directly find the sqlite file?

fl <-("MoNA-export-LipidBlast_2022-split_100k/MoNA-export-LipidBlast_2022-split_1.json")
cmps <- compound_tbl_lipidblast(fl)
cmps

my session info is :

R version 4.3.2 (2023-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC tzcode source: system (glibc)

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages: [1] jsonlite_1.8.8 mzR_2.36.0 Rcpp_1.0.12
[4] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[7] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[10] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[13] tidyverse_2.0.0 CompoundDb_1.6.0 AnnotationFilter_1.26.0 [16] MetaboCoreUtils_1.10.0 MetaboAnnotation_1.6.1 Spectra_1.12.0
[19] ProtGenerics_1.34.0 BiocParallel_1.36.0 S4Vectors_0.40.2
[22] BiocGenerics_0.48.1

loaded via a namespace (and not attached): [1] MultiAssayExperiment_1.28.0 magrittr_2.0.3
[3] fs_1.6.4 zlibbioc_1.48.2
[5] vctrs_0.6.5 memoise_2.0.1
[7] RCurl_1.98-1.14 base64enc_0.1-3
[9] htmltools_0.5.8.1 S4Arrays_1.2.1
[11] AnnotationHub_3.10.0 curl_5.2.1
[13] SparseArray_1.2.4 htmlwidgets_1.6.4
[15] cachem_1.0.8 uuid_1.2-0
[17] igraph_2.0.3 mime_0.12
[19] lifecycle_1.0.4 pkgconfig_2.0.3
[21] Matrix_1.6-5 R6_2.5.1
[23] fastmap_1.1.1 GenomeInfoDbData_1.2.11
[25] MatrixGenerics_1.14.0 shiny_1.8.0
[27] clue_0.3-65 digest_0.6.35
[29] rsvg_2.6.0 colorspace_2.1-0
[31] AnnotationDbi_1.64.1 GenomicRanges_1.54.1
[33] RSQLite_2.3.6 filelock_1.0.3
[35] fansi_1.0.6 timechange_0.3.0
[37] httr_1.4.7 abind_1.4-5
[39] compiler_4.3.2 bit64_4.0.5
[41] withr_3.0.0 DBI_1.2.2
[43] MASS_7.3-60 ChemmineR_3.54.0
[45] rappdirs_0.3.3 DelayedArray_0.28.0
[47] rjson_0.2.21 tools_4.3.2
[49] interactiveDisplayBase_1.40.0 httpuv_1.6.14
[51] glue_1.7.0 QFeatures_1.12.0
[53] promises_1.2.1 grid_4.3.2
[55] pbdZMQ_0.3-11 cluster_2.1.4
[57] generics_0.1.3 gtable_0.3.5
[59] tzdb_0.4.0 hms_1.1.3
[61] xml2_1.3.6 utf8_1.2.4
[63] XVector_0.42.0 BiocVersion_3.18.1
[65] pillar_1.9.0 IRdisplay_1.1
[67] later_1.3.2 BiocFileCache_2.10.1
[69] lattice_0.22-5 bit_4.0.5
[71] tidyselect_1.2.1 Biostrings_2.70.3
[73] gridExtra_2.3 IRanges_2.36.0
[75] SummarizedExperiment_1.32.0 Biobase_2.62.0
[77] matrixStats_1.3.0 DT_0.32
[79] stringi_1.8.4 lazyeval_0.2.2
[81] yaml_2.3.8 evaluate_0.23
[83] codetools_0.2-19 MsCoreUtils_1.14.1
[85] BiocManager_1.30.23 cli_3.6.2
[87] IRkernel_1.3.2 xtable_1.8-4
[89] repr_1.1.7 munsell_0.5.1
[91] GenomeInfoDb_1.38.8 dbplyr_2.5.0
[93] png_0.1-8 parallel_4.3.2
[95] ellipsis_0.3.2 blob_1.2.4
[97] bitops_1.0-7 scales_1.3.0
[99] ncdf4_1.22 crayon_1.5.2
[101] rlang_1.1.3 KEGGREST_1.42.0

I split the json file in python so that i can avoid loading on memory. The code i used to split the json:

import ijson
import json
from decimal import Decimal

def decimal_default(obj):
    if isinstance(obj, Decimal):
        return float(obj)
    raise TypeError

def split_large_json(input_file, output_prefix, chunk_size):
    with open(input_file, 'r') as f:
        # Initialize the JSON item iterator
        objects = ijson.items(f, 'item')

        chunk = []
        chunk_count = 0

        for obj in objects:
            chunk.append(obj)
            if len(chunk) == chunk_size:
                # Write the current chunk to a new file
                with open(f'{output_prefix}_{chunk_count + 1}.json', 'w') as chunk_file:
                    json.dump(chunk, chunk_file, indent=4, default=decimal_default)
                chunk = []
                chunk_count += 1

        # Write any remaining items in the last chunk
        if chunk:
            with open(f'{output_prefix}_{chunk_count + 1}.json', 'w') as chunk_file:
                json.dump(chunk, chunk_file, indent=4, default=decimal_default)

# Replace 'large_file.json' with the path to your large JSON file
input_file = 'MoNA-export-LipidBlast_2022-json (1)/MoNA-export-LipidBlast_2022.json'
# Prefix for the output chunk files
output_prefix = 'MoNA-export-LipidBlast_2022-split_100k/MoNA-export-LipidBlast_2022-split'
# Size of each chunk
chunk_size = 100000

# Split the large JSON file into smaller pieces
split_large_json(input_file, output_prefix, chunk_size)
jorainer commented 3 months ago

Thanks for adding this issue. I'm currently working on a solution to import/process large json files in a chunk-wise manner.

jorainer commented 3 months ago

Hi @pratarora , I've added a new parameter n to compound_tbl_lipidblast that you can use to define the number of lines of a (large) json file that should be loaded and processed at a time. This should help avoiding the memory problems you reported.

Thus, the code below (with n = 100000) should work on the full file (it worked on my computer).

fl <-("MoNA-export-LipidBlast_2022.json")
cmps <- compound_tbl_lipidblast(fl, n = 100000, verbose = TRUE)

For now, you would need to install this version of the package from github using BiocManager::install("RforMassSpectrometry/CompoundDb") (assuming you have the BiocManager package installed - which you can with install.packages("BiocManager").

pratarora commented 3 months ago

Hi Johannes

Thanks a lot for the quick fix.

I tried it, but it works partially. The exactmass column is converted to character while reading. Inchi and inchikey is not read, I can change exactmass manually to numeric and get the sqlite file.

cmps %>% str

tibble [1,346,798 × 7] (S3: tbl_df/tbl/data.frame)
 $ compound_id: chr [1:1346798] "LipidBlast2022_999999" "LipidBlast2022_999998" "LipidBlast2022_999997" "LipidBlast2022_999996" ...
 $ name       : chr [1:1346798] "HexCer 27:3;3O/15:0;(2OH)" "HexCer 27:3;3O/14:1;(2OH)" "HexCer 27:3;3O/14:0;(2OH)" "HexCer 27:3;3O/13:1;(2OH)" ...
 $ inchi      : chr [1:1346798] "" "" "" "" ...
 $ inchikey   : chr [1:1346798] NA NA NA NA ...
 $ formula    : chr [1:1346798] "C48H89NO10" "C47H85NO10" "C47H87NO10" "C46H83NO10" ...
 $ exactmass  : chr [1:1346798] "839.648648048" "823.6173479199999" "825.632997984" "809.601697856" ...
 $ synonyms   :List of 1346798
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  .. [list output truncated]
jorainer commented 3 months ago

oooooh, thanks for the report! I will check that. Should be numeric of course!

pratarora commented 3 months ago

I am working on it, I will post the code for it soon.

jorainer commented 3 months ago

should be fixed now. thanks for reporting!

pratarora commented 3 months ago

Hi Johannes,

I modified the code. I tried to make it a bit parallel as well. It can now read all the fields for both the old lipidblast (https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/70d4b8c0-4863-4632-ade4-0ace697bdaa1) and the 2022 one (https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/d0899a6d-15b7-46ad-a836-f07313465fb8). I will create a pull request if it seems ok for you.

.parse_lipidblast_json_element  <- function(x) {
    id <- x$id
    cmp <- x$compound[[1]]

    # Helper function to extract metadata from multiple sources
    get_metadata <- function(name, sources) {
        for (source in sources) {
            value <- unlist(lapply(source, function(z) {
                if (z$name == name) z$value
            }))
            if (length(value) > 0 && !all(value == "")) {
                return(value[1])
            }
        }
        return(NA_character_)
    }

    # Define metadata sources
    metadata_sources <- list(x$metaData, cmp$metaData)

    # Get names
    nms <- vapply(cmp$names, `[[`, "name", FUN.VALUE = "character")
    snms <- if (length(nms) > 1L) nms[-1L] else NA_character_

    # Extract metadata
    mass <- get_metadata("total exact mass", metadata_sources)
    mass <- as.numeric(mass)
    if (is.na(mass)) mass <- NA_real_

    frml <- get_metadata("molecular formula", metadata_sources)
    inchi <- get_metadata("InChI", metadata_sources)
    inchikey <- get_metadata("InChIKey", metadata_sources)
    smiles <- get_metadata("SMILES", metadata_sources)
    compound_class <- get_metadata("compound class", metadata_sources)
    ionization_mode <- get_metadata("ionization mode", metadata_sources)
    precursor_mz <- as.numeric(get_metadata("precursor m/z", metadata_sources))
    precursor_type <- get_metadata("precursor type", metadata_sources)
    retention_time <- as.numeric(get_metadata("retention time", metadata_sources))
    ccs <- as.numeric(get_metadata("ccs", metadata_sources))

    # Extract spectrum data
    spectrum <- x$spectrum
    if (is.null(spectrum)) spectrum <- NA_character_

    list(
        compound_id = id,
        name = nms[1],
        inchi = inchi,
        inchikey = inchikey,
        smiles = smiles,
        formula = frml,
        exactmass = mass,
        compound_class = compound_class,
        ionization_mode = ionization_mode,
        precursor_mz = precursor_mz,
        precursor_type = precursor_type,
        retention_time = retention_time,
        ccs = ccs,
        spectrum = spectrum,
        synonyms = list(snms)
    )
}
.import_lipidblast <- function(file, n = -1L, verbose = FALSE, BPPARAM = bpparam()) {
    if (n < 0) {
        lipidb <- jsonlite::read_json(file)
        if (verbose)
            message("Processing ", length(lipidb), " elements ...", appendLF = FALSE)

        # Use BiocParallel to process elements in parallel
        res <- bplapply(lipidb, .parse_lipidblast_json_element, BPPARAM = BPPARAM)

        if (verbose) message(" done.")
    } else {
        res <- .import_lipidblast_json_chunk(file, n = n, verbose = verbose, BPPARAM = BPPARAM)
    }
    dplyr::bind_rows(res)
}

.import_lipidblast_json_chunk <- function(x, n = 10000, verbose = FALSE, BPPARAM = bpparam()) {
    con <- file(x, open = "r")
    on.exit(close(con))
    res <- list()
    while (length(ls <- readLines(con, n = n, warn = FALSE))) {
        if (length(grep("^\\[", ls[1L])))
            ls <- ls[-1L]
        if (length(grep("^\\]", ls[length(ls)])))
            ls <- ls[-length(ls)]
        ls <- sub(",$", "", ls)
        if (length(ls)) {
            chunk_res <- bplapply(ls, function(z) {
                .parse_lipidblast_json_element(jsonlite::fromJSON(z, simplifyVector = FALSE))
            }, BPPARAM = BPPARAM)
            res <- c(res, chunk_res)
        }
        if (verbose)
            message("Processed ", length(ls), " elements")
    }
    res
}
compound_tbl_lipidblast <- function(file, collapse = character(), n = -1L,
                                    verbose = FALSE, BPPARAM = bpparam()) {
    .check_parameter_file(file)
    res <- .import_lipidblast(file, n = n, verbose = verbose, BPPARAM = BPPARAM)
    if (length(collapse)) {
        ## collapse elements from lists.
        res$synonyms <- vapply(res$synonyms, paste0, collapse = collapse[1L],
                                FUN.VALUE = "character")
    }
    res
}
pratarora commented 3 months ago

new lipidblast image

old lipidblast image

pratarora commented 3 months ago

Hi Johannes, I created the pull request: https://github.com/rformassspectrometry/CompoundDb/pull/117