jts / ncov-tools

Small collection of tools for performing quality control on coronavirus sequencing data and genomes
MIT License
47 stars 16 forks source link

Lineage assignment of 'none' when processing many samples #85

Closed Mach-2 closed 3 years ago

Mach-2 commented 3 years ago

Hello, I've been having a problem lately where the run_name_summary_qc.tsv file is failing to populate with lineage assignments. The lineage, lineage_notes, and scorpio_call columns all contain values of 'none' for all samples.

If I run fewer samples, the output seems fine. If there are ~100 or more samples being analyzed, the columns show the 'none' value. The lineage assignments in the lineage/lineage_report.csv seem to be fine regardless of how many samples I'm running ncov-tools on.

Any help with this would be appreciated! Currently I'm just doing a bit of a workaround by stealing the lineages from the lineage_report.csv and sticking them into the run_name_summary_qc.tsv file before I generate the pdf report, but I'd definitely prefer a cleaner fix.

Thanks! Madison

rdeborja commented 3 years ago

Can you tell me the version of ncov-tools and ncov-parser being used?

On Jun 11, 2021, at 5:55 PM, Madison Chapel @.***> wrote:

Hello, I've been having a problem lately where the run_name_summary_qc.tsv file is failing to populate with lineage assignments. The lineage, lineage_notes, and scorpio_call columns all contain values of 'none' for all samples.

If I run fewer samples, the output seems fine. If there are ~100 or more samples being analyzed, the columns show the 'none' value. The lineage assignments in the lineage/lineage_report.csv seem to be fine regardless of how many samples I'm running ncov-tools on.

Any help with this would be appreciated! Currently I'm just doing a bit of a workaround by stealing the lineages from the lineage_report.csv and sticking them into the run_name_summary_qc.tsv file before I generate the pdf report, but I'd definitely prefer a cleaner fix.

Thanks! Madison

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jts/ncov-tools/issues/85, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAV7CQLQ7IYBNO2QKWUY3TTSKA3VANCNFSM46RRK4MQ.

Mach-2 commented 3 years ago

Yep! ncov-tools is 1.7.1, and ncov-parser should be 0.6.7. We re-made the environment at 1:00pm today with mamba straight from the environment.yml file

rdeborja commented 3 years ago

I just re-created my environment and made sure to match yours with the latest version. I ran it on 158 samples and all three columns (i.e. lineage, lineage_notes, scorpio_call) correctly showed the values in the lineage_report.csv file. I ran it with usher and pangolearn as the pangolin inference engine and both populated the fields in the _summary_qc.tsv file correctly.

Were there any errors in the snakemake log file by chance or did everything complete successfully?

On Jun 11, 2021, at 6:25 PM, Madison Chapel @.***> wrote:

Yep! ncov-tools is 1.7.1, and ncov-parser should be 0.6.7. We re-made the environment at 1:00pm today with mamba straight from the environment.yml file

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jts/ncov-tools/issues/85#issuecomment-859938636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAV7CTLCUTZA767MCD7433TSKENBANCNFSM46RRK4MQ.

DarianHole commented 3 years ago

So this doesn't appear to be a sample number issue in itself although the number of samples do seem to affect the final output for some reason. When I'm running ncov-tools on a lower number of samples (tested on 7 samples) I have no problem with the final result and see the lineage call as expected.

I tried again on a slightly larger dataset, 17 samples including the original 7, and I ran into the same issue with the lineages populating as none, (my env was made last week Friday as well I believe) image

However I do get proper lineages populating the table when I change the config value of completeness_threshold from what we normally have it set to, which is 0.5, to the default of 0.75

So I'd hazard a guess its somewhere in the completeness_threshold parameter but the fact that the number of samples plays a role is odd

rdeborja commented 3 years ago

@DarianHole just to confirm, the _lineage_report.csv file is currently populated with the expected lineage info. The none values only occurs in the _summary_qc.tsv file correct?

DarianHole commented 3 years ago

Correct! Sorry that was important info I should have included!

Lineage report from pangolin is as expected, the summary_qc file has the none values which then leads to that being in the pdf output

rdeborja commented 3 years ago

@DarianHole @Mach-2 which platform (i.e. Illumina, Oxford-nanopore) were you processing when this issue occurred?

DarianHole commented 3 years ago

I've seen it with both nanopore and the freebayes Illumina files as input. I could quickly check if the ivar Illumina files lead to a similar output though as I am sure I have a number I could quickly find.

Exact input for the nanopore:

data_root: files
amplicon_bed: amplicon.bed
primer_bed: nCoV-2019.bed
bed_type: unique_amplicons
offset: 0
reference_genome: nCoV-2019.reference.fasta
bam_pattern: "{data_root}/{sample}.sorted.bam"
consensus_pattern: "{data_root}/{sample}.consensus.fasta"
variants_pattern: "{data_root}/{sample}.pass.vcf.gz"
metadata: metadata.tsv
assign_lineages: true
platform: oxford-nanopore
completeness_threshold: 0.50

Command is just the all command for me:

snakemake -s workflow/Snakefile all
rdeborja commented 3 years ago

@DarianHole I was able to reproduce the issue with a Nanopore run. The problem is with ncov-parser and the lineage file parser. The lack of scorpio call caused the note field to be populated with Assigned from designation hash. I'm not sure if or how the number of samples affect scorpio calls, but have since fixed ncov-parser under branch issue_85. Do you mind giving it a test?

DarianHole commented 3 years ago

Thanks for the quick work Richard! It seems like it may not have been a sample number issue but instead a certain type of sample whose output that was messing with the final output (so more samples makes it more likely).

I can test it right now and report back pretty quick as well

DarianHole commented 3 years ago

Yep @rdeborja I can confirm that the changes to the ncov-parser fix the issue for my dataset! Thanks for the quick fix and help as always!

rdeborja commented 3 years ago

@DarianHole great, thanks for the quick turn around. I've merged and created a new release (ncov-parser 0.6.8) on pypi. @Mach-2 can you this resolved your issue as well?

Mach-2 commented 3 years ago

Thanks for the quick solution Richard (and thanks for helping test things, Darian)! Darian and I were describing the same issue, so I think it's fair to assume that if the fix worked for his test data set then things are good to go now.