google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

Training tutorial? #27

Closed jelber2 closed 2 years ago

jelber2 commented 2 years ago

Curious if there will be a tutorial (sorry if I missed it somewhere in repo) for training to make a custom DeepConsensus model for other than human PacBio HiFi reads? I tried DeepConsensus on Bacterial PacBio HiFi/CCS reads, and as expected, it does not perform as well as it does for Human.

danielecook commented 2 years ago

@jelber2 there is currently no tutorial for performing training with DeepConsensus.

We are planning on releasing this functionality in the future.

If you can provide more insight in terms of the performance it could be helpful to us. We are actively experimenting with training datasets.

jelber2 commented 2 years ago

Here is one example of performance, I did not use the --all setting when running pbccs https://github.com/PacificBiosciences/harmony/issues/1

AndrewCarroll commented 2 years ago

Hi @jelber2

This is an interesting observation, thank you for bringing it up. So far, we've run DeepConsensus (trained on human) on several non-human species (multiple plants, frog and mouse) and have received reports from others on E.coli. In each case, we've still seen either direct (gap-compressed identity) or indirect evidence (better assembly and YAK values) that DeepConsensus is generalizing to those other species well.

Based on those observations, we're fairly optimistic that a single model should apply well across species. If there are counter-examples, it would be good to know in order to adjust our strategy for training.

Are you able to share the subreads files for this samples? It could be useful for us to replicate your findings in order to better understand them.

Thank you, Andrew

MariaNattestad commented 2 years ago

Looking at your other issue, I think you might get better results on DeepConsensus by just following the DeepConsensus quick start more closely, like using ccs --all and just making sure that you're not manipulating the subreads in any way other than what is explicitly stated in our quick start.

jelber2 commented 2 years ago

@AndrewCarroll I unfortunately cannot share the subreads. The only thing I can think is that these are E coli reads from a PacBio Sequel (not Sequel II). Here are the results of comparing --all to not all with and without deepconsensus. image

jelber2 commented 2 years ago

I was able to get permission to share the data. @AndrewCarroll I sent you an email with links about a month ago, but I did not hear a response if you accessed the data.

AndrewCarroll commented 2 years ago

Hi @jelber2

I do see the email in my inbox when I search for your name. I am not sure how I missed itt originally. I will download the data now and take a look.

jelber2 commented 2 years ago

@AndrewCarroll , feel free to post any updates with running these data through DeepConsensus here. My guess is either I am doing something wrong or the data having come from a PacBio Sequel may be the issue.

jelber2 commented 2 years ago

So, I ran deepconsensus-0.3.1-gpu using pbccs-6.3.0 --min-rq=0.88, then ran harmony-0.2.0 (https://github.com/PacificBiosciences/harmony/releases/tag/v0.2.0) as before. One can certainly see an improvement in the deepconsenus-0.3.1 corrected reads (ccs.rq.deepcon-0.3.1) relative to the deepconsensus-0.2.0 reads (ccs.all.deepcon and ccs.notall.deepcon) and ccs reads (ccs.all, ccs.notall, ccs.rq). ccs.all is using pbccs-6.3.0 --all, ccs.notall is using pbccs-6.3.0 defaults, and ccs.rq is using pbccs-6.3.0 --min-rq=0.88.

ccs-deepcon

AndrewCarroll commented 2 years ago

Hi @jelber2

I was able to run DeepConsensus v0.3.1 on your data, and everything seems to have run smoothly. Although I don't have the empirical quality calculations relative to the reference, I can use the predicted quality values from pbccs and DeepConsensus. These values are roughly in-line with expectations from datasets. I see a ~12% increase in the number of reads at >Q20 for DeepConsensus relative to pbccs.

This observation makes me wonder if one of the reasons you don't see more separation of the curves in your plot is that DeepConsensus is able to rescue many reads that would normally have <Q20 in pbccs (and therefore not be present in the notall dataset). As a result, the comparison is complicated by the fact that the DeepConsensus bins may have a larger number of more difficult reads.

I wonder if it might be more meaningful to plot the sequence yield output at given quality between the methods. For example, at a quality of Q30+, how many bases are present in the DeepConsensus dataset as compared to the pbccs. This would help to disentangle the confounding factor of the amount of sequence yield between the methods.

Alternatively, you could also explicitly filter the reads to the same read set between the different methods to get apples-to-apples comparisons on exactly the same data.

I am curious to hear your feedback on these potential strategies.

Thank you, Andrew

jelber2 commented 2 years ago
# get pbccs-6.3.0 not all (default settings) read names
samtools view -@34 ccs.notall.bam |cut -f 1 > names1

# filter by name
filterbyname.sh include=t names=names1 in=ccs.rq.deepconsensus-0.3.11.bam out=STDOUT.sam | \
samtools view -Sb -@8 > ccs.rq.deepconsensus-0.3.11.notall.reads.bam

# run harmony
harmony -j 34 ccs.rq.deepconsensus-0.3.11.notall.reads.bam ../flye/assembly.fasta ccs.rq.deepcon-0.3.1.na

# run harmony
harmony -j 34 ccs.rq.deepconsensus-0.3.11.bam ../flye/assembly.fasta ccs.rq.deepcon-0.3.1

# make the plot
./single.R ccs.rq.deepcon-0.3.1 ccs.rq.deepcon-0.3.1.na

ccs-deepcon