replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 16 forks source link

LCS UCSC marker update feature #231

Closed MarieLataretu closed 2 years ago

MarieLataretu commented 2 years ago

Adds the functionality to generate an updated UCSC marker table via:

--lcs_ucsc_version       Create marker table based on a specific UCSC SARS-CoV-2 tree (e.g. '2022-05-01'). Use 'predefined' 
                             to use the marker table from the repo (most probably not up-to-date) [default: predefined]
                                See https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2 for available trees.
--lcs_ucsc_predefined    If '--lcs_ucsc_version 'predefined'', select pre-calculated UCSC table [default: 2022-01-31]
                                 See https://github.com/rki-mf1/LCS/tree/master/data/pre-generated-marker-tables
--lcs_ucsc_update        Use latest UCSC SARS-CoV-2 tree for marker table update. Overwrites --lcs_ucsc_version [default: false]
                                 Automatically checks https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.version.txt
--lcs_ucsc_downsampling  Downsample sequences when updating marker table to save resources. Use 'None' to turn off [default: 10000]
                                 Attention! Updating without downsampling needs a lot of resources in terms of memory and might fail.
                                 Consider downsampling or increase the memory for this process.
--lcs_variant_groups     Provide path to custom variant groups table (TSV) for marker table update. Use 'default' for predefined groups from repo
                                 (https://github.com/rki-mf1/LCS/blob/master/data/variant_groups.tsv) [default: default]

Waits for nanozoo LCS container with updated usher version

replikation commented 2 years ago

@MarieLataretu this one? nanozoo/lcs_sc2:1.1.0--3741450

hoelzer commented 2 years ago

@MarieLataretu this one? nanozoo/lcs_sc2:1.1.0--3741450

Yep, we bilaterally communicated that yesterday : )

And that's really cool now bc/ we are able to update the reference set of marker mutations and dont need to rely on updates done by the original authors.

@MarieLataretu per default you use the lineage TSV from the original repo? So that's actually important to add new lineages (e.g. BA.4, BA.5) and updating this list when generating a new marker mutation table index.

MarieLataretu commented 2 years ago

@MarieLataretu this one? nanozoo/lcs_sc2:1.1.0--3741450

Yes, will update it in a second!

@MarieLataretu per default you use the lineage TSV from the original repo? So that's actually important to add new lineages (e.g. BA.4, BA.5) and updating this list when generating a new marker mutation table index.

Yes, default is the predefined table (2022-01-31) from the original repo generated with an old variant group table (https://github.com/rki-mf1/LCS/blob/4fd9bf2d976cfe9e1ba7ffe0e9b50d46945c91ef/data/variant_groups.tsv) I'm currently running an update with the new variant groups, no downsampling and the UCSC tree from yesterday, but that takes some time

In the LCS fork, I updated the variant group table by adding BA.4 and BA.5: https://github.com/rki-mf1/LCS/blob/master/data/variant_groups.tsv Feel free to PR there - updated marker tables will then contain the changes. Or use lcs_variant_groups for a custom file.

replikation commented 2 years ago

@MarieLataretu

WARN: Access to undefined parameter `lcs_cutoff` -- Initialise it to a default value eg. `params.lcs_cutoff = some_value`
replikation commented 2 years ago

I also get this message:

# command
./poreCov/poreCov.nf --fastq "1.Reads/*fastq.gz" -profile ukj_cloud --screen_reads 

# error message
[null] NOTE: Can't stage file file:///home/replikation/Desktop/tmp_test_porecov/default -- file does not exist -- Error is ignored

Not sure what its trying to stage here

this error is not appearing on the current poreCov release candidate

replikation commented 2 years ago
executor >  google-lifesciences (50)
[71/4f238a] process > read_qc_wf:nanoplot (4)                              [100%] 5 of 5 ✔
[b4/eda5b7] process > filter_fastq_by_length (4)                           [100%] 5 of 5 ✔
[skipped  ] process > read_classification_wf:download_database_kraken2     [100%] 1 of 1, stored: 1 ✔
[bf/6354c5] process > read_classification_wf:kraken2 (4)                   [100%] 4 of 4 ✔
[3d/7fe72f] process > read_classification_wf:krona (4)                     [100%] 4 of 4 ✔
[-        ] process > read_classification_wf:lcs_ucsc_markers_table        -
[-        ] process > read_classification_wf:lcs_sc2                       -

command:

./poreCov/poreCov.nf --fastq "1.Reads/*fastq.gz" -profile ukj_cloud --screen_reads
MarieLataretu commented 2 years ago

@MarieLataretu

WARN: Access to undefined parameter `lcs_cutoff` -- Initialise it to a default value eg. `params.lcs_cutoff = some_value`

This should be fixed now, as well as the staging problem. (The problem was my cloud-unfriendly optional input.)

Somehow the update with a custom variant group file (--lcs_variant_groups new_groups.tsv --lcs_ucsc_update) is again seg faulting with the current container - I'll debug that next week. (I had seg faults with Usher version 0.5.0, but not 0.4.0 and 0.5.4 before.)

replikation commented 2 years ago

i still have a

[94/10650c] NOTE: Process `create_summary_report_wf:summary_report_default (1)` terminated with an error exit status (1) -- Error is ignored

need to check out why this is happening

replikation commented 2 years ago

but lcs is running now

MarieLataretu commented 2 years ago

but lcs is running now

Nice!

I figured out the seg faults/fails: the input file for matUtils was empty, because I accidentally introduced a white space in the custom variant group file. I made small changes in the LCS fork (added a strip() and test for 0 samples) and added a new pre-generated marker table (date: 2022-05-15, variant groups: https://github.com/rki-mf1/LCS/blob/master/data/variant_groups.tsv) which is now default.

MarieLataretu commented 2 years ago

i still have a

[94/10650c] NOTE: Process `create_summary_report_wf:summary_report_default (1)` terminated with an error exit status (1) -- Error is ignored

need to check out why this is happening

Try it with --update. I think the report expects the output of pangolin version >= 4.0.0.

DataSpott commented 2 years ago

@replikation tested it now with one of our routine batches, 46 samples in total. Run was successful. Screen-read process failed in total 9 times with exit-code 14 (preemtible-exit: node was closed by google), but was finally successful for all samples. So the code is working correctly. Report-html is generated correctly and results are equal to the original analysis.

Command: nextflow run ~/test_poreCov/poreCov/poreCov.nf -profile ukj_cloud --update --extended --primerV V1200 --rapid --minLength 150 --medaka_model r941_min_sup_g507 --screen_reads --fastq_pass ~/nano-server/GRIDION_DISK/20220422_covid_routine_batch80/20220422_covid_routine_batch80/20220422_1347_X3_FAR83359_2d4ca88d/fastq_pass/ --samples ~/nano-server/GRIDION_DISK/20220422_covid_routine_batch80/20220422_covid_routine_batch80.csv --output ~/test_poreCov/result

poreCov_testrun