halelab / GBS-SNP-CROP

GBS SNP Calling Reference Optional Pipeline
GNU General Public License v2.0
31 stars 31 forks source link

Proper restriction enzyme sequence for single enzyme experiment using ApeKI #10

Closed charlesdavid closed 7 years ago

charlesdavid commented 7 years ago

Hello, just wondering of you could point me in the right direction here: Trying to analyze a single enzyme experiment using apeKI, so I would like to specify the correct sequence for parsing. Problem is, I am getting more than half the reads with no identifiable restriction site. I am using -enz1 CAG -enz2 CTG as parameters. So specifically, should I state only -enz1 CWG (using W as either A or T)? Will SNP-CROP process a command line with only one enzyme specified without crashing due to uninitialized $enz2? And am I using the correct restriction sequence? Seems to me that the code is looking for matches to both restriction sequences and reporting a ~50% failure rate.... Here is a summary:

SUMMARY: Total raw reads = 218,152,893 Reads with no identifiable restriction site = 123,456,905 Reads with no identifiable barcode = 13,236,709 Usable reads = 81,459,279 Percentage of usable reads = 37.34

Thanks for any clarification on this issue.

arthurmelobio commented 7 years ago

Hi Charles, thank you for writing.

Please, go to google group discussion page and follow the "wobble base in restriction site" discussion.

https://groups.google.com/forum/#!topic/gbs-snp-crop/MMebvO1OGlU

Thank you.

On Wed, Dec 21, 2016 at 2:53 AM, Charles David notifications@github.com wrote:

Hello, just wondering of you could point me in the right direction here: Trying to analyze a single enzyme experiment using apeKI, so I would like to specify the correct sequence for parsing. Problem is, I am getting more than half the reads with no identifiable restriction site. I am using -enz1 CAG -enz2 CTG as parameters. So specifically, should I state only -enz1 CWG (using W as either A or T)? Will SNP-CROP process a command line with only one enzyme specified without crashing due to uninitialized $enz2? And am I using the correct restriction sequence? Seems to me that the code is looking for matches to both restriction sequences and reporting a ~50% failure rate.... Here is a summary:

SUMMARY: Total raw reads = 218,152,893 Reads with no identifiable restriction site = 123,456,905 Reads with no identifiable barcode = 13,236,709 Usable reads = 81,459,279 Percentage of usable reads = 37.34

Thanks for any clarification on this issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/halelab/GBS-SNP-CROP/issues/10, or mute the thread https://github.com/notifications/unsubscribe-auth/ANeR81sYTZGKxmfA0pUjdbKUdwHrkjrRks5rKLDYgaJpZM4LSkyf .

--

Arthur T. de O. Melo Postdoctoral Research Assistant College of Life Sciences and Agriculture | University of New Hampshire http://www.unh.edu/halelab http://lattes.cnpq.br/5898950331344200

charlesdavid commented 7 years ago

@arthurmelobio Thanks for your response. However, there are still some issues: First, the script edits suggested in the referenced post are not possible, as line 173 is blank and the variable $R1seq does not exist. I am using version 2 SNP-CROP scripts, and this edit only applies to version 1. Thus it looks as though there are only two options here for this analysis pipeline:

1) Hack the script to remove the getOption #enz2 by commenting out line 26 ; supply a "neutral string" to prevent code breakage on command line; and run twice with each of the two restriction site sequences, concatenating the results.

2) Do as in 1) above but leave $enz2 and supply a "dummy" string for $enz2

3) Figure out how to hack the version 2 perl script (PLEASE SPECIFY)

I personally do not like these options. They are very 'hacky' and will be hard to justify in a best practices document. Why not generalize your code so that it would be more able to handle the wide range of GBS experiments? For example, by making $enz2 an option instead of a requirement? Also, why not allow wobble bases, or at the least, more than one search string to be specified?

Thanks, Charles

halelab commented 7 years ago

Hi Charles,

I agree with you these suggested changes are not good. Professor Iago and me had thought before about something like your suggestions ... I'll work on the code trying to solve this issue.

I'll let you know as soon as I appropriately change the code.

Best, Arthur

halelab commented 7 years ago

Hi Charles, thanks for pushing us on this issue.

I don’t think you want to turn off enzyme2, because your GBS fragment should indeed have the ApeKI residue on both ends. Requiring the presence of that residue is good practice, in terms of basic quality control.

The only complication you have here is that, due to the wobble base, you have four possible fragments in your data:

  1. CAGC………GCTG
  2. CAGC………GCAG
  3. CTGC………GCTG
  4. CTGC………GCAG

The easiest thing to do would be to run Script 1 four times, once for each of the four scenarios above, and then concatenate the results before moving forward. For example, the four scenarios for Script 1 would be:

  1. -enz1 CAGC -enz2 CAGC <— Note that the enz2 residue is the RC of what’s above for Scenario 1
  2. -enz1 CAGC -enz2 CTGC
  3. -enz1 CTGC -enz2 CAGC
  4. -enz1 CTGC -enz2 CTGC

This should harvest all your reads. Please let us know how this goes, and thank you again, Iago

halelab commented 7 years ago

Quick clarification: Those four fragments scenarios I mentioned above were written out for one strand only. i.e. CAGC.......GCTG really means 5'-CAGC.......GCTG-3'. Hence the need to specify the reverse complement when parameterizing -enz2.

halelab commented 7 years ago

One final comment: The long-term solution to this will be to allow the specification of multiple enzymes and/or wobble bases, as you suggest. This will take us a little while; so for the time being, I suggest the solution above, which requires multiple runs but no hacking.