Sequenced reads are too short

ylang0825 commented 1 year ago

Hi, I intended to enrich Saccharomyces cerevisiae in ZymoBIOMICS mock microbial community. Even channels are treated as the adaptive group by setting “—even” and odd channels as control. After running the repeat masking scripts, I built index for the masked genome of Saccharomyces cerevisiae. The command for uncalled running is “uncalled realtime SC_masked --port 8001 -t 8 --enrich --even -c 3 > SC_masked.paf “. In the experiment I could see the purple bar emerged, proving that UNCALLED was ejecting reads. However, when I analyzed data, I found the coverage against SC was lower in the adaptive group than control. After looking into the sequencing summary text, I found the key reason was that sequenced reads were too short. In the adaptive group, unblocked reads had mean length of 1277bp (median 1280bp), while sequenced reads (labelled signal positive) had mean length of 1844bp (median 1059). In the control group, sequenced reads had mean length of 7218bp (median 7116). It seems that sequenced reads are much shorter in the adaptive group than control. Could you give me some advice on this? Thank you very much.

skovaka commented 1 year ago

Hi,

Are you using the Zymo High Molecular Weight pre-extracted DNA, or a different Zymo product? Either way, I'd suggest measuring the read lengths of Saccharomyces and bacterial reads separately, as read lengths can differ by species and Saccharomyces is the least abundant. How are you identifying which reads were unblocked and which were kept? I recommend checking the UNCALLED PAF file for the "ej:" tag which labels ejected (unblocked) reads. I would also be interested to see how many unblocked reads align to Saccharomyces (i.e. what is the false negative rate)? If most Saccharomyces reads are ejected then it's possible UNCALLED isn't keeping up with sequencing.

Note that in the paper we depleted bacteria rather than enriching yeast. I do expect your experiment should work better than the results you got, but it enrichment is "harder" than depletion because UNCALLED must give up on mapping after a small number of chunks. For depletion the most likely errors would be falsely kept reads, while for you the most likely error is falsely ejected reads. I hope that helps!

Thanks, Sam

ylang0825 commented 1 year ago

Hi sam, Thanks very much for your quick reply! I looked into the PAF file and fould most of the reads were not mapped to the refenrence and had the label "ej:". This is expected because Saccharomyces is the least abundant. Then I changed reference to E. coli, but I also found few reads were mapped to reference. Even I provided the whole Zymo genome (~40M) as reference, most of the reads had the label 'ej:' which were not mapped to reference. I think the key reason was that UNCALLED failed to align reads to the reference index. I'm quite confused since UNCALLED did not report any error.

skovaka commented 1 year ago

What type of flowcell are you using? UNCALLED only supports r9.4.1, which is now a "legacy" kit. That was sort of buried in the README, so I just added a more prominent notice at the top. I hope that didn't cause any confusion.

If you are using r9.4 data, have you tried mapping previously sequenced reads to the full Zymo reference using uncalled map? You should expect around ~90% alignment rate. If that works well then it would be a realtime-specific issue, which is harder to debug.

Thanks, Sam

ylang0825 commented 1 year ago

I used r9.4.1 flowcells. I tried mapping previously sequenced fast5 to the full Zymo reference using uncalled map. That works as most of the reads could be mapped to reference. But the .paf file in the realtime sequencing had few reads mapped to reference. I checked that the same read could be mapped to reference using uncalled map but not mapped using uncalled realtime. Thanks!

skovaka / UNCALLED

Sequenced reads are too short #55