nanoporetech / pychopper

A tool to identify, orient, trim and rescue full length cDNA reads
Other
78 stars 22 forks source link

Can we use pychopper to trim all the reads including non-full-length reads? #62

Open lauraht opened 2 years ago

lauraht commented 2 years ago

Hello!

I have cDNA and direct RNA Nanopore reads which contain poly-A/poly-T sequences and adapters/barcodes.

A friend suggested me to use pychopper to remove those poly-A/poly-T sequences and adapters/barcodes.

I see that pychopper is a tool to identify, orient, and trim full-length cDNA reads. What I am looking for is to remove poly-A/poly-T sequences and adapters/barcodes for all the reads that contain them (including both full-length and non-full-length).

So I have the following questions and would appreciate your advice: (1) Can I use pychopper to remove poly-A/poly-T sequences and adapters/barcodes for all the reads that contain them including non-full-length reads? (2) Is there an option in pychopper that only does trimming without orienting? I only want to trim the reads. (3) Can pychopper remove poly-A/poly-T sequences and adapters/barcodes for direct RNA reads as well?

Thank you very much!

lauraht commented 2 years ago

Hello!

I look forward to hearing from you. And I would greatly appreciate your advice and suggestion.

Thank you very much!

callumparr commented 2 years ago

For cDNA, you may use this tool primer-chop. It does a similar thing to pychopper for stranded libraries in addition to also trimming the polyA tails. You can add your own primers in addition to a suitable length of As. Pychopper only trims reads that it classifies as full-length. If the pychopper cannot find the adapters in order to orientate the reads it's obviously not possible to trim them.

For direct RNA. you can consider your library already stranded in that there is only one direction as it's only the original molecule. As for trimming the polyA tails that is more complicated because the calling of the poly-A tail is actually disrupted by the raw signal coming from the DNA-based nanopore adapter when the entire read is run through the RNA basecaller which is trained on RNA not DNA. You'll often see something more like AGAGAGAGA..... rather AAAAAAAA. Like DNA, guppy does attempt to remove most of the nanopore sequencing adapter sequencing when you use the '--trim_strategy "rna"'. This should be set by default when you call the RNA config files in your guppy command. Currently, to call the length of poly-A signal it is required to use the raw signal data using nanopolish or tailfindr.

From what I have seen all papers just take the guppy output, filter from Q-scores and also possibly length and then align to genome/transcriptome, adapters and polA (AGGAGAGAGAG....) will be soft-clipped in your alignments.