multimeric / TidyMultiqc

Converts 'MultiQC' Reports into Tidy Data Frames
https://multimeric.github.io/TidyMultiqc/
GNU General Public License v3.0
14 stars 6 forks source link

New JSON format on MultiQC 1.20 #11

Open fgvieira opened 5 months ago

fgvieira commented 5 months ago

Does TidyMultiqc supports the new json format on MultiQC version 1.20 (multiqc/megaqc#519)?

multimeric commented 5 months ago

No, but thanks for the reminder. I'll take a look.

multimeric commented 5 months ago

So it seems that the format is fundamentally the same, the only difference is that it now uses Plotly, so I guess that has resulted in a change in the plot data format. For example, .report_plot_data.qualimap_coverage_histogram.datasets is different between the two formats.

Likely I will just have to add some more plot parsers and possibly update the vignette. However, the default functionality works as is, especially if you don't want to extract plot data.

multimeric commented 5 months ago

Hi @fgvieira, can you please test if my branch works for your use case? You can test it using remotes::install_github("multimeric/TidyMultiqc", ref="multiqc_1.2").

fgvieira commented 5 months ago

Thanks for the super fast reply!

Parsing the general and raw, it seems to work fine:

df <- load_multiqc("multiqc_data.json",
  sections=c("general", "raw"),
  find_metadata = function(sample, parsed) {
    parsed[c(
      "config_creation_date",
      "config_version",
      "config_output_dir"
    )]
  }
)

But, when parsing plot:

> df <- load_multiqc("multiqc_data.json",
  sections=c("general", "raw", "plot"), plots=list_plots("multiqc_data.json")$id,
  find_metadata = function(sample, parsed) {
    parsed[c(
      "config_creation_date",
      "config_version",
      "config_output_dir"
    )]
  }
)

I get an error that seems to be related to modules that were run multiple times.

Example data: multiqc_data.json.zip

multimeric commented 5 months ago

I really don't recommend that you try to grab all the data like this. The general data is generally more sensible than raw, so only use the latter if absolutely essential. In terms of plots, it's not trivial to implement parsers for each plot type, so I've only done so for a small subset. If there is a specific plot that you think you need for your analysis then feel free to open an issue about it. However, requesting all plot data doesn't really make sense to me. If you want all the data, or you want to answer a very specific question about the data, then I would suggest loading the JSON file yourself in R.

apeltzer commented 1 month ago

Hm, some issues I found while trying that branch:

Caused by error in `parse_con()`:
! lexical error: invalid char in json text.
                  "mapped_failed_pct": NaN,                 "paired in
                     (right here) ------^
Run `rlang::last_trace()` to see where the error occurred.

(its output from nf-core/rnaseq, so easily reproducible: see here for the file - https://nf-co.re/rnaseq/3.14.0/results/rnaseq/results-b89fac32650aacc86fcda9ee77e00612a1d77066/aligner_star_salmon/multiqc/star_salmon/multiqc_report_data/)

Been using the JSON file

multimeric commented 1 month ago

That pipeline seems to be using an old MultiQC. 1.19 maybe? I can't tell. The error you're getting is the same as #6 and #9, which was only fixed in MultiQC 1.22. Also, that isn't related to this issue, because it wasn't affected by the 1.20 format.

apeltzer commented 1 month ago

Thanks Michael - I'll make sure the multiqc module in rnaseq gets upgraded to 1.23 which should address the issue mentioned :)