Scheme detection fails depending on read order

bede commented 2 years ago

I've encountered some strange behaviour where Viridian throws a KeyError if reads which can assemble fine are in a different order. Attached is a small MRE with latest docker image including logs and the exact commands run. Presumably this behaviour might change in the forthcoming release, but I'd be v grateful if you could replicate and/or comment. These are simulated reads and may be fundamentally unsound, however this behaviour is surprising as a user.

iqbal-lab commented 2 years ago

This is a very interesting bug @bede , thanks for raising it! Thanks also for providing a MRE - i don't see any file showing what the Key error was or when it happened? Would it be possible to add a readme to explain the contents of the deepak zip? It looks to me like A-08a84971 failed with no json output, but A-08a84971.8c3 passed/succeeded, is that right?

But for D-f27c0b3e.085, it seems to have failed and provided a json. And that json ends with

"read_and_primer_stats": null, "read_sampling": null, "viridian": null, "self_qc": null

which means we didn't even get as far as finishing processing the reads ready for assembly.

Are you providing 2 examples , each with 2 attempts with reads in different orders (which you have deliberately reordered?), and in one case it failed with no output and one with a json? Which had the key error and what was the exact output?

We'll take a look but @martinghunt is away on holiday for the next week, so

bede commented 2 years ago

Thanks Zam, sorry I didn't provide the KeyError, and I'm afraid there is a defect with my example anyway which makes it far less interesting – the reads are valid pairs but the L and R reads appear to be duplicated, so it's not surprising viridian encounters issues, feel free to close for now while I investigate.

iqbal-lab commented 2 years ago

ah, well, if the L and R reads are duplicated, i expect viridian might say "these aren't behaving right for any of our amplicons", but i wouldn't expect a key error, it would just say there was "no good data". OK closing, feel free to re-raise

bede commented 2 years ago

Agree with your comments – here's that KeyError for reference:

(base) bede@yoshina ~ % docker run -v /Users/bede/Desktop/deepak-testcase-partial-viridian-repro:/data ghcr.io/iqbal-lab-org/viridian_workflow:latest viridian_workflow run_one_sample --tech illumina --ref_fasta /data/MN908947v3.fasta --reads1 /data/D-f27c0b3e/085c811a-587d-5f6f-4977-130026e90eb2.reads_1.fastq --reads2 /data/D-f27c0b3e/085c811a-587d-5f6f-4977-130026e90eb2.reads_2.fastq --outdir /data/D-f27c0b3e.085
[2022-07-22T13:35:06 viridian_workflow INFO] Start running viridian_workflow, output dir: /data/D-f27c0b3e.085
[2022-07-22T13:35:06 viridian_workflow INFO] No primer schemes provided. Using all built in schemes
[2022-07-22T13:35:06 viridian_workflow INFO] Processed amplicon scheme files. Amplicon scheme names: COVID-AMPLISEQ-V1,COVID-ARTIC-V3,COVID-ARTIC-V4,COVID-MIDNIGHT-1200
[2022-07-22T13:35:06 viridian_workflow INFO] Mapping reads to reference
[M::mm_idx_gen::0.008*0.54] collected minimizers
[M::mm_idx_gen::0.010*0.61] sorted minimizers
[M::main::0.010*0.62] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.010*0.62] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.010*0.62] distinct minimizers: 4989 (99.98% are singletons); average occurrences: 1.002; average spacing: 5.979; total length: 29903
[M::worker_pipeline::11.662*0.30] mapped 99000 sequences
[M::main] Version: 2.17-r974-dirty
[M::main] CMD: minimap2 -R @RG\tID:1\tSM:sample -t 1 -a -x sr /data/MN908947v3.fasta /data/D-f27c0b3e/085c811a-587d-5f6f-4977-130026e90eb2.reads_1.fastq /data/D-f27c0b3e/085c811a-587d-5f6f-4977-130026e90eb2.reads_2.fastq
[M::main] Real time: 11.674 sec; CPU: 3.533 sec; Peak RSS: 0.106 GB
[2022-07-22T13:35:18 viridian_workflow INFO] Detecting amplicon scheme and gathering read statistics
Traceback (most recent call last):
  File "/usr/local/bin/viridian_workflow", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/viridian_workflow/__main__.py", line 215, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/viridian_workflow/tasks/run_one_sample.py", line 13, in run
    one_sample_pipeline.run_one_sample(
  File "/usr/local/lib/python3.8/dist-packages/viridian_workflow/one_sample_pipeline.py", line 447, in run_one_sample
    pipeline.run()
  File "/usr/local/lib/python3.8/dist-packages/viridian_workflow/one_sample_pipeline.py", line 418, in run
    self.initial_read_map_and_detect_amplicon_scheme()
  File "/usr/local/lib/python3.8/dist-packages/viridian_workflow/one_sample_pipeline.py", line 191, in initial_read_map_and_detect_amplicon_scheme
    self.amplicon_tsv = self.amplicon_scheme_name_to_tsv[chosen_scheme]
KeyError: None

jeff-k commented 2 years ago

that's super helpful, thanks! This was a regrettably poor error message for when the amplicon scheme fails to be detected in v0.3.7.

The newer images throw an exception with a proper explanation: https://github.com/jeff-k/viridian_workflow/blob/0317b31804075719b96a4fb17182a7217be06f4a/viridian_workflow/readstore.py#L203

iqbal-lab commented 2 years ago

So...why was it read order dependent??

jeff-k commented 2 years ago

I don't think the issue was with the read order. In the failed case the R1 and R2 reads were the same, this would cause the fragments to fail amplicon identification.

bede commented 2 years ago

Jeff is spot on. This has turned into a KeyError report rather than viridian-behaves-weirdly-due-to-read-order report. Sorry for misleading, and thanks!

bede commented 1 year ago

For the record, uploading an empty fastq reproduces this KeyError in 0.3.7 with docker and in gpas-dev.

iqbal-lab-org / viridian

Scheme detection fails depending on read order #90