richardmleggett / nextclip

Nextera Long Mate Pair analysis and processing tool
GNU General Public License v3.0
19 stars 7 forks source link

Numerous undetected adaptors in B,C,D cats #11

Open MichaelFokinNZ opened 10 years ago

MichaelFokinNZ commented 10 years ago

Richard hi! I've decided to start new issue, just to share more info about undetected adaptors. I am working with MiSeq reads 300bp, my pipeline is (raw data -> nextclip -> fastq-mcf -> blastn) last two steps are to check if any adaptors still present and finally I'am checking these cases manually in Geneious. "A" files almost doesn't suffer from junction adaptors - there are left <30/1M reads (fastq-mcf), and I haven't inspected this in details. "B" and "C" files look worse :( there are from hundreds to 23k adaptors per 1M reads detected by fastq-mcf, and mention that this software is able to detect end/start adaptors only, not from inside the sequence, so really there are more. I've analysed in details some of these files and found that:

  1. Only few (dozens) duplicated adaptors left - all cases have 1 nucleotide indel in the junction site
  2. There are plenty of single adaptors with 100% hit to the read, both is terminal and inside positions.... :( I would say few thousands per 1M reads and more partial adaptors less than 18 nucleotides.
  3. I have not analysed read pairs yet.

I have no experience/ideas could it affect de-novo assembly, but will try not to avoid using B,C categories.

richardmleggett commented 10 years ago

Hi,

Thanks for this.

Are you saying that there are category B & C cases where, after processing with NextClip, there are all 19 bases of the junction adaptor present? If so, if you could send me a file of example reads (e.g a few hundred reads) before processing, then I will try and work out what is going on…

For de novo assembly, we would use categories A, B and C, but leave out D.

Thanks, Richard

On 12 Jun 2014, at 10:47, MikhailFokinNZ notifications@github.com<mailto:notifications@github.com> wrote:

Richard hi! I've decided to start new issue, just to share more info about undetected adaptors. I am working with MiSeq reads 300bp, my pipeline is (raw data -> nextclip -> fastq-mcf -> blastn) last two steps are to check if any adaptors still present and finally I'am checking these cases manually in Geneious. "A" files almost doesn't suffer from junction adaptors - there are left <30/1M reads (fastq-mcf), and I haven't inspected this in details. "B" and "C" files look worse :( there are from hundreds to 23k adaptors per 1M reads detected by fastq-mcf, and mention that this software is able to detect end/start adaptors only, not from inside the sequence, so really there are more. I've analysed in details some of these files and found that:

  1. Only few (dozens) duplicated adaptors left - all cases have 1 nucleotide indel in the junction site
  2. There are plenty of single adaptors with 100% hit to the read, both is terminal and inside positions.... :( I would say few thousands per 1M reads and more partial adaptors less than 18 nucleotides.
  3. I have not analysed read pairs yet.

I have no experience/ideas could it affect de-novo assembly, but will try not to avoid using B,C categories.

— Reply to this email directly or view it on GitHubhttps://github.com/richardmleggett/nextclip/issues/11.

MesutOezil commented 10 years ago

Hi all,

I have been using NextClip for a few different species' genomes and for several different mate interval ranges. But, I have never found cases like this - undetected/remaining internal junction adaptors.

For your info, the latest version of FASTQC (released in early this month; http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) can detect Nextera junction adaptors in reads and report them to you, in the panel 'Adaptor Content'. This new function is very useful in evaluating mate-pair read properties and confirming removal of adaptors after running NextClip. I hope this helps you,, too.

Best regards,

Shigehiro

(2014/06/12 22:53), Richard Leggett wrote:

Hi,

Thanks for this.

Are you saying that there are category B & C cases where, after processing with NextClip, there are all 19 bases of the junction adaptor present? If so, if you could send me a file of example reads (e.g a few hundred reads) before processing, then I will try and work out what is going on…

For de novo assembly, we would use categories A, B and C, but leave out D.

Thanks, Richard

On 12 Jun 2014, at 10:47, MikhailFokinNZ notifications@github.com<mailto:notifications@github.com> wrote:

Richard hi! I've decided to start new issue, just to share more info about undetected adaptors. I am working with MiSeq reads 300bp, my pipeline is (raw data -> nextclip -> fastq-mcf -> blastn) last two steps are to check if any adaptors still present and finally I'am checking these cases manually in Geneious. "A" files almost doesn't suffer from junction adaptors - there are left <30/1M reads (fastq-mcf), and I haven't inspected this in details. "B" and "C" files look worse :( there are from hundreds to 23k adaptors per 1M reads detected by fastq-mcf, and mention that this software is able to detect end/start adaptors only, not from inside the sequence, so really there are more. I've analysed in details some of these files and found that:

  1. Only few (dozens) duplicated adaptors left - all cases have 1 nucleotide indel in the junction site
  2. There are plenty of single adaptors with 100% hit to the read, both is terminal and inside positions.... :( I would say few thousands per 1M reads and more partial adaptors less than 18 nucleotides.
  3. I have not analysed read pairs yet.

I have no experience/ideas could it affect de-novo assembly, but will try not to avoid using B,C categories.

— Reply to this email directly or view it on GitHubhttps://github.com/richardmleggett/nextclip/issues/11.

— Reply to this email directly or view it on GitHub https://github.com/richardmleggett/nextclip/issues/11#issuecomment-45893742.

richardmleggett commented 10 years ago

Thanks.

I meant to say previously that don’t forget you can adjust the match parameters with --strict_match

Thanks, Richard

On 12 Jun 2014, at 17:34, MesutOezil notifications@github.com<mailto:notifications@github.com> wrote:

Hi all,

I have been using NextClip for a few different species' genomes and for several different mate interval ranges. But, I have never found cases like this - undetected/remaining internal junction adaptors.

For your info, the latest version of FASTQC (released in early this month; http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) can detect Nextera junction adaptors in reads and report them to you, in the panel 'Adaptor Content'. This new function is very useful in evaluating mate-pair read properties and confirming removal of adaptors after running NextClip. I hope this helps you,, too.

Best regards,

Shigehiro

(2014/06/12 22:53), Richard Leggett wrote:

Hi,

Thanks for this.

Are you saying that there are category B & C cases where, after processing with NextClip, there are all 19 bases of the junction adaptor present? If so, if you could send me a file of example reads (e.g a few hundred reads) before processing, then I will try and work out what is going on…

For de novo assembly, we would use categories A, B and C, but leave out D.

Thanks, Richard

On 12 Jun 2014, at 10:47, MikhailFokinNZ notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Richard hi! I've decided to start new issue, just to share more info about undetected adaptors. I am working with MiSeq reads 300bp, my pipeline is (raw data -> nextclip -> fastq-mcf -> blastn) last two steps are to check if any adaptors still present and finally I'am checking these cases manually in Geneious. "A" files almost doesn't suffer from junction adaptors - there are left <30/1M reads (fastq-mcf), and I haven't inspected this in details. "B" and "C" files look worse :( there are from hundreds to 23k adaptors per 1M reads detected by fastq-mcf, and mention that this software is able to detect end/start adaptors only, not from inside the sequence, so really there are more. I've analysed in details some of these files and found that:

  1. Only few (dozens) duplicated adaptors left - all cases have 1 nucleotide indel in the junction site
  2. There are plenty of single adaptors with 100% hit to the read, both is terminal and inside positions.... :( I would say few thousands per 1M reads and more partial adaptors less than 18 nucleotides.
  3. I have not analysed read pairs yet.

I have no experience/ideas could it affect de-novo assembly, but will try not to avoid using B,C categories.

— Reply to this email directly or view it on GitHubhttps://github.com/richardmleggett/nextclip/issues/11.

— Reply to this email directly or view it on GitHub https://github.com/richardmleggett/nextclip/issues/11#issuecomment-45893742.

— Reply to this email directly or view it on GitHubhttps://github.com/richardmleggett/nextclip/issues/11#issuecomment-45915665.

MichaelFokinNZ commented 10 years ago

Thanks guys! I will analyse my data more precisely (new FastQC is awesome!) and provide you some files.