marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 126 forks source link

Write a user guide specifically for ONT data #742

Open rhpvorderman opened 7 months ago

rhpvorderman commented 7 months ago

ONT data does needs adapter chopping, quality filtering and length filtering. Adapter chopping for nanopore now references porechop, which is unmaintained as of 2018. Nanoflt is also recommended. I think cutadapt can fill all these roles, and also do a better job at it. A user guide highlighting these practices should be written. I am volunteering to write this guide when I have the time.

Points that should be included:

I think that a run through sequali and a run through cutadapt should be enough to ensure that reads are ready for further downstream processing.

It is important that I assign some priority to this. I just did a NCBI blast for CCTGTACTTCGTTCAGTTACGTATTGCT which is the nanopore ligation kit top strand adapter. I found 300+ hits in the ncbi database. Most of them had a 100% sequence identity across 26 of the 28 bases and all of the hits were sequenced using nanopore. The top hits were from 2022 and 2023. So even though nanopore sequencing has been around for quite some time, proper best practices do not seem to have registered with everyone. And no this is not just in some obscure plasmids from sewage water, it is in the entire naked molerat genome which has hits in several chromosomes.

rhpvorderman commented 6 months ago

I just had a LUMC-internal talk about using sequali and cutadapt for assessing nanopore data. There was also somebody from the Leiden Genome Technology Center who gave me more information about nanopore sequencing. Currently the default flow is to use chopper to cut off the ends of reads indiscriminately.

One reason the ends of reads are worse quality is that adapter sequences contain artificial bases that are not the typical Adenin, Cytosin, Guanin and Thymin. They might be listed as such but they are artifical. Another reason is that the motor protein is attached to the adapter and this is a single-stranded, not double stranded part (unlike the sequence of interest). This means that it will get through the poor much easier and much faster. Hence there will be a lot of deletions in the calling. Just chopping the ends of is therefore nice and fast as well as being just as correct. There is still plenty of DNA left after all.

Another thing to take care of is that adapters might link together creating a chimeric read that is joined by adapters. A good QC practice is to check these sort of sequences and split the reads that have this. This can be done by the sequencing center, and it is therefore important to request this.

The cutting can be solved with --cut 50 --cut -50 or similar. Chimeric reads can be detected by looking for adapter sequences and then discarding the read I guess. Alternately a chimeric read splitter could be added to cutadapt if this is a common enough problem. Chimeric reads are about 10% of the data on the latest ligation protocol if read splitting was not enabled at the sequencer.

diego-rt commented 6 months ago

Just want to add that I completely agree that it would be fantastic if cutadapt could incorporate ONT-centric options. Currently there is no tool that adequately deals with that. In particular, the features that are missing are:

  1. Chimeric read splitting. At present this is done in dorado but one might want to have more flexibility in the parameters, as well as double checking if it's really critical.
  2. Trimming of base modification tags (MM and ML) if they are present in the fastq header. Naively trimming reads without adjusting these tags basically results in corruption of the base modification information. Some additional discussion here
rhpvorderman commented 6 months ago

These are good suggestions. I have no experience yet with base modification tags, so I can't comment on that. Chimeric read splitting should be possible fiven that cutadapt already has an alignment algorithm implemented for adapter detection. Using --cut and relying on the adapter trimmer to detect chimeric reads is not a satisfactory solution. I will make a separate issue for that.

I made the issue here #747

Currently there is no tool that adequately deals with that.

I find this very disappointing about the nanopore ecosystem. It does not seem to have mature tools that adequately deal with the problems that it poses. Luckily cutadapt seems very well equipped to deal with the problems, it justs need a little tweaking for the full potential to come out.

diego-rt commented 6 months ago

Yes I fully agree. I just ran cutadapt (-g TTAACCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA -a TGAAGCGGCGCACGAAAAACGCGAAAGCGTTTCACGATAAATGCGAAAACGGTTAA --error-rate 0.3 --overlap 30 -m 1000) on my ultra long data and after one round of 5' adapter trimming, running it again detected ~1% of reads still containing adapters so I guess that's the rate of chimeric reads that dorado still lets through.

I also find it crazy that there's no tools that can adequately deal with adapters after all these years. It's not even that hard of a problem when you are doing it already downstream of basecalling. But great to hear that someone is looking into it!

OZTaekOppa commented 6 months ago

Thank you for your excellent program and explanation. As I delved deeper into the details, I fortunately identified the issue.

If it's okay, I have a few questions for clarification.

FYI, I'm working with multiple human ONT Ultra-Long datasets (not dorado). According to information from the ONT team, the sequence for the Rapid Adapter (RA) used in the ultra-long sequencing kit (SQK-ULK114) is as follows: Rapid Adapter (RA) 5’-TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT-3’

Q1: What parameters (or example command) would be most effective in removing the RA sequences from the FASTQ? In a preliminary test on a single dataset, I used a stand-alone Cutadapt with three different options: "-a," "-g," and "-b." Each option resulted in a slightly different trimmed FASTQ file. Based on Cutadapt and the ONT team's guidance, it seems appropriate to select the regular 5' adapter (-g) for my ONT ultra-long datasets.

Q2: Should I consider setting insertion and deletion scores or options addressed in #745?

Q3: In reference to @diego-rt's suggestion, the command used both -g and -a options with complementary sequences. Q3-1: Why are the adapter sequences different for the ultra-long reads between mine and @diego-rt's? Q3-2: Should I take into account the --error-rate, --overlap, and -m options?

Any feedback would be greatly appreciated.

rhpvorderman commented 6 months ago

Q1. I recommend not using any of the -a -g -b options to remove adapters at ends. These bases have really low quality and a lot of insertions and deletions. As a result cutadapt's alignment method does not find all candidates. It is best to use --cut 60 --cut -60. This will remove 60 bp of each end indiscriminately. All adapters are cut. Since the sequences are supposedly long, this should not matter much for your results.
In addition I recommend using the -g option with the adapter and use the --discard option. If there are chimeric reads in your data they will have adapter sequence in the middle. The best way is to split them, but cutadapt has no machinery to do this. Since a chimeric read can consist of more than two reads, adapter trimming is also not an option. So the best way is to use --discard and simply remove chimeric reads from the dataset that way. A dataset can be between 2 and 10% chimeric reads depending on protocol. I recommend also finding the back adapter and use that with -a.

Q2: No it is not necessary. Cutadapt prefers mismatches over indels when resolving an alignment. For illumina this makes sense as indels are a 100 times less common. For nanopore indels are much more common but I believe less common than mismatches. I should investigate this a bit more. For now it seems that the defaults are fine.

Q3-1: That depends on the protocol used. I really can't comment on that. I recommend speaking to your sequencing center. Q3-2: Yes definitely. Set the error rate at 0.2. That allows 7 errors in your adapter. That should be enough to detect chimeric reads. 0.25 should also work without getting a massive false positive rate for this adapter of length 36. --overlap is only relevant for adapters at read extremities, these have already been taken care of with the --cut option, so setting --overlap 10 shoud be reasonable (default is 3, and that will have quite a lot of false positives on the read ends). Minimum length depends on what you want to do. I recommend --minimum-length 500. For length 500 and below, using illumina sequencing with a big insert size is much better. You could set it to 10_000 even. That will lower the execution times of assemblers and aligners considerably so that might be worth it.

EDIT: Oh yes, and please don't forget to set --max-aer to some error rate that you find acceptable.

I still want to talk to someone at the LGTC to solidify my usage guide, when the time is there I will make a pull request to the cutadapt documentation.

OZTaekOppa commented 5 months ago

Thank you for your prompt response. To provide additional clarification:

During re-basecalling of ONT Ultra-Long reads (via buttery-eel), I utilized the following settings for adapter removal: --detect_mid_strand_adapter --trim_adapters --detect_adapter --do_read_splitting --trim_strategy dna. Considering these settings, I assume that chimeric reads, involving both 3' and 5' ends, including mid-strand adapters, were considered and processed. The Rapid Adapter (RA) used in the ultra-long sequencing kit (SQK-ULK114) has the sequence: 5’-TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT-3’.

For cutadapt, I am considering the following approach: Example command 1: cutadapt --cut 60 --cut -60 -g TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT -a AGCAATACGTAACTGAACGAAGTACAGGAAAAAAAA --discard --error-rate 0.25 --overlap 10 -m 500 --max-aer

Example command 2: cutadapt --cut 60 --cut -60 -g TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT --discard --error-rate 0.25 --overlap 10 -m 500 --max-aer

I hope this command is designed to remove residual adapter sequences based on the provided information. The -g option targets the forward strand, and -a represents the reverse-complement sequence. If needed, I can modify the example commands (based on your recommendation).

One question that arose is whether I can use the -b option for this task, or if I should run -g and -a separately. Please advise if any modifications are necessary.

Your assistance on this matter is greatly appreciated.

rhpvorderman commented 5 months ago

--max-aer is missing an error rate in the example. You could set it to something like 0.1 or 0.15.

If the adapters and chimeric sequences are already removed by the basecaller there is really no need to use cutadapt for that purpose. Unless you find the adapter removal by the basecaller lacking. You could check if adapters are still present with sequali (disclaimer: still in beta, and I am the developer).

OZTaekOppa commented 5 months ago

Thank you for your response.

I'm interested in utilizing your program, sequali. Upon reviewing the "--adapter-file," I couldn't locate the "Rapid Adapter (RA) used in the ultra-long sequencing kit (SQK-ULK114)" with the sequence 5’-TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT-3’.

Do I need to append the required adapter sequences to the existing "--adapter-file," perhaps in the following format?

Rapid Adapter (RA) used in the ultra-long sequencing kit: SQK-ULK114

Oxford nanopore Rapid Adapter (RA) nanopore TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT start

Or should I create a new adapter sequence specifically for my task?

Any feedback would be really appreciated.

rhpvorderman commented 5 months ago

Hi. It is the same as the ligation kit adapter. That also has sequence TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT.

Please ask further questions regarding sequali on the issue tracker there, to keep this issue on topic.

diego-rt commented 5 months ago

Hey guys,

Sorry for my late reply. Just a heads up that for ultra long (and I assume also for other rapid adapter based library preps but I dont have any data to check myself) you need to trim not only the RA sequence, but also the transposase adapter sequence. You could use the sequence above that I derived from a MSA of the start of my reads or search in the porechop adapter list. Trimming the Rapid adapter sequence is not sufficient.

The adapter can only occur in the forward sense (i.e. at the start of the read). I checked the reverse complement just to get an idea of the false discovery rate with my alignment settings and ensure they were not too lax.

rhpvorderman commented 5 months ago

Thanks for the extra feedback. Much appreciated!

OZTaekOppa commented 5 months ago

Hi @diego-rt,

Appreciate your insights.

After experimenting, the current pipeline (Buttery-eel + Guppy or Dorado) seems effective for ONT data. For removing any remaining Rapid Adapter (RA) sequences, I employed Cutadapt with the -g option and Sequali in my Nextflow pipeline.

Now, I'm keen to explore and test the transposase adapter sequence. What would be the most suitable and reasonable approach to clean both RA and transposase adapter sequences post-Buttery-eel pipeline?

Possible Approaches:

1: Cutadapt (-g option only) 2: Porechop + Cutadapt (-g or -b options??) 3: An alternative combination?

Looking forward to your guidance and suggestions.

diego-rt commented 5 months ago

I would try pore chopfirst and cutadapt second. Since porechop can split reads based on internal adapters and therefore deal with chimeric reads. You can then use cutadapt for a second round of adapter trimming now that most chimeric reads have been split.

OZTaekOppa commented 5 months ago

Thank you for your insights.

I wanted to share that I've been exploring the Butterry-eel pipeline, integrated with Guppy and Dorado, and have found its features quite impressive. Its capabilities, especially parameters like --detect_mid_strand_adapter, --trim_adapters, --detect_adapter, and --do_read_splitting, provide comprehensive solutions for detecting, trimming, removing, and splitting both reads and adapters.

Given that the Butterry-eel pipeline seems efficient in trimming and splitting internal adapters, would it be advisable to transition directly to the cutadapt stage (without Porechop or Porechop_ABI)?

Looking forward to your thoughts on this.

rhpvorderman commented 4 months ago

Note to self: for further information about dorado and read splitting: https://github.com/nanoporetech/dorado/issues/510

Given that the Butterry-eel pipeline seems efficient in trimming and splitting internal adapters, would it be advisable to transition directly to the cutadapt stage (without Porechop or Porechop_ABI)?

@OZTaekOppa I assume that the buttery-eel pipeline will have done a good job as it will instruct the guppy basecaller to do the read splitting and adapter trimming. You could check the results with sequali if significant portions of the adapter are still present. If all the data is fine, cutadapt is technically not needed.

However since bioinformaticians often get already called data from sequencing centers and as such have no control over the base calling, I still think cutadapt can be a valuable tool. Especially when working with publicly available nanopore data. Also I want something that can double check whatever the base callers have been doing.

Tanglab-Jiang commented 3 months ago

As a friendly reminder, be sure to set --times 2 when using cutadapt if you want both the -g and -a options to take effect. Otherwise, only the first specified option (-g or -a) will be considered by cutadapt when trimming adapter sequences.

marcelm commented 3 months ago

Thank you for pointing out --times, but I have to correct this:

Otherwise, only the first specified option (-g or -a) will be considered by cutadapt when trimming adapter sequences.

Both adapters will be considered, but only the best matching adapter will be removed: https://cutadapt.readthedocs.io/en/stable/guide.html#multiple-adapters