rrwick / Porechop

adapter trimmer for Oxford Nanopore reads
GNU General Public License v3.0
323 stars 124 forks source link

Effectiveness of --min_trim_size #37

Open HenrivdGeest opened 6 years ago

HenrivdGeest commented 6 years ago

Hello,

We are trying to get a hand on the demultiplexing part. According to the manual the --min_trim_size is the minimal size of the alignment against an adapter. With the verbose 3 option I can see that the highlighted part in red changes if I modify this parameter. However the outputted sequences in bins do not change.

For example with

"-check_reads 100 --no_split **--min_trim_size 50** --end_size 75"
.....
  Barcode  Reads  Bases   File                       
  BC01         4   4,662  /project/method4/BC01.fasta
  BC02         4   3,971  /project/method4/BC02.fasta
  BC03         2   5,182  /project/method4/BC03.fasta
  BC04         5   5,951  /project/method4/BC04.fasta
  none        35  40,121  /project/method4/none.fasta

and with

"--check_reads 100 --no_split -**-min_trim_size 3** --end_size 75"

  Barcode  Reads  Bases   File                       
  BC01         4   4,546  /project/method4/BC01.fasta
  BC02         4   3,828  /project/method4/BC02.fasta
  BC03         2   5,046  /project/method4/BC03.fasta
  BC04         5   5,684  /project/method4/BC04.fasta
  none        35  39,534  /project/method4/none.fasta

With the verbose 3 option for example read # 43 get the following asignment: first Image --min_trim_size 3 --end_size 100 : (second image:-min_trim_size 50 --end_size 100)

image

As you can see, the highlighting in red (the barcode matching) changes but the final assignment does not. Is this expected behavior? If so, How do I set for example a minimal alignment length of 30bp on either ends of the sequence as a requirement for barcode matching and binning?

rrwick commented 6 years ago

Hi Henri,

The --min_trim_size option controls how long an alignment must be for the sequence to be trimmed, and you probably don't have to change it from the default of 4. This threshold exists because Porechop does semi-global overlap alignment for adapters at the ends of reads (specifically free end-gaps in both sequences, like this). Very short alignments are possible, and if the alignment is too short, it's likely to be spurious.

For example, if you were using ligation barcoding, the start-read adapters end with CAGCACCT. If there was no min_trim_size, then any read which started with a T would match and get a base trimmed off:

CAGCACCT
       TACGATCGACTACGTACG...

25% of reads would match, whether or not they have the adapter! That may not be a big problem (I've noticed that Trim Galore doesn't have a minimum trim), but by requiring an overlap of at least 4 (by default), we avoid so many false positive trims. Now only more confident alignments are trimmed:

CAGCACCT
    ACCTGCATCGAGCGATCGTCT...

In your example, --min_trim_size 50 is equivalent to saying 'I'm not willing to trim an adapter unless I see 50 or more aligned bases', and that's far too stringent. This prevented a lot of trimming, and is why --min_trim_size 50 resulted in more bases in your output. However, that setting doesn't affect the barcode scan, which is why you got the same reads in each output bin, regardless of the setting. I.e. the red highlighting in the verbose output shows which bases are trimmed, but non-trimmed bases can still be used for barcode assignment.

The --end_size setting is different - it controls how much of the read ends are used in alignments. The default (150) should be plenty - reducing this may make Porechop run faster, but reducing it too much will cause it to miss adapter alignments. You are of course more than welcome to play with the settings for the sake of experimentation, but I don't think you would need to change the defaults for either --min_trim_size or --end_size.

To address your final question, Porechop assigns barcodes by aligning the entire 24 bp barcode sequence to the read and getting the alignment score (specifically, it score the alignment where end-gaps are free in the read sequence but not free in the barcode sequence, see details [here]()). The default settings should be fine for this, but if you want to be stricter about the barcode binning, you can increase the --barcode_threshold setting (default of 75).

Phew - that's a lot of info! Does it make sense? Please let me know if I can elaborate on any part!

Ryan