Closed charlesdavid closed 7 years ago
Hi Charles, thank you for writing.
Please, go to google group discussion page and follow the "wobble base in restriction site" discussion.
https://groups.google.com/forum/#!topic/gbs-snp-crop/MMebvO1OGlU
Thank you.
On Wed, Dec 21, 2016 at 2:53 AM, Charles David notifications@github.com wrote:
Hello, just wondering of you could point me in the right direction here: Trying to analyze a single enzyme experiment using apeKI, so I would like to specify the correct sequence for parsing. Problem is, I am getting more than half the reads with no identifiable restriction site. I am using -enz1 CAG -enz2 CTG as parameters. So specifically, should I state only -enz1 CWG (using W as either A or T)? Will SNP-CROP process a command line with only one enzyme specified without crashing due to uninitialized $enz2? And am I using the correct restriction sequence? Seems to me that the code is looking for matches to both restriction sequences and reporting a ~50% failure rate.... Here is a summary:
SUMMARY: Total raw reads = 218,152,893 Reads with no identifiable restriction site = 123,456,905 Reads with no identifiable barcode = 13,236,709 Usable reads = 81,459,279 Percentage of usable reads = 37.34
Thanks for any clarification on this issue.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/halelab/GBS-SNP-CROP/issues/10, or mute the thread https://github.com/notifications/unsubscribe-auth/ANeR81sYTZGKxmfA0pUjdbKUdwHrkjrRks5rKLDYgaJpZM4LSkyf .
--
Arthur T. de O. Melo Postdoctoral Research Assistant College of Life Sciences and Agriculture | University of New Hampshire http://www.unh.edu/halelab http://lattes.cnpq.br/5898950331344200
@arthurmelobio Thanks for your response.
However, there are still some issues:
First, the script edits suggested in the referenced post are not possible, as line 173 is blank and the variable $R1seq
does not exist. I am using version 2 SNP-CROP scripts, and this edit only applies to version 1. Thus it looks as though there are only two options here for this analysis pipeline:
1) Hack the script to remove the getOption #enz2 by commenting out line 26 ; supply a "neutral string" to prevent code breakage on command line; and run twice with each of the two restriction site sequences, concatenating the results.
2) Do as in 1) above but leave $enz2 and supply a "dummy" string for $enz2
3) Figure out how to hack the version 2 perl script (PLEASE SPECIFY)
I personally do not like these options. They are very 'hacky' and will be hard to justify in a best practices document. Why not generalize your code so that it would be more able to handle the wide range of GBS experiments? For example, by making $enz2 an option instead of a requirement? Also, why not allow wobble bases, or at the least, more than one search string to be specified?
Thanks, Charles
Hi Charles,
I agree with you these suggested changes are not good. Professor Iago and me had thought before about something like your suggestions ... I'll work on the code trying to solve this issue.
I'll let you know as soon as I appropriately change the code.
Best, Arthur
Hi Charles, thanks for pushing us on this issue.
I don’t think you want to turn off enzyme2, because your GBS fragment should indeed have the ApeKI residue on both ends. Requiring the presence of that residue is good practice, in terms of basic quality control.
The only complication you have here is that, due to the wobble base, you have four possible fragments in your data:
The easiest thing to do would be to run Script 1 four times, once for each of the four scenarios above, and then concatenate the results before moving forward. For example, the four scenarios for Script 1 would be:
This should harvest all your reads. Please let us know how this goes, and thank you again, Iago
Quick clarification: Those four fragments scenarios I mentioned above were written out for one strand only. i.e. CAGC.......GCTG really means 5'-CAGC.......GCTG-3'. Hence the need to specify the reverse complement when parameterizing -enz2.
One final comment: The long-term solution to this will be to allow the specification of multiple enzymes and/or wobble bases, as you suggest. This will take us a little while; so for the time being, I suggest the solution above, which requires multiple runs but no hacking.
Hello, just wondering of you could point me in the right direction here: Trying to analyze a single enzyme experiment using apeKI, so I would like to specify the correct sequence for parsing. Problem is, I am getting more than half the reads with no identifiable restriction site. I am using
-enz1 CAG -enz2 CTG
as parameters. So specifically, should I state only-enz1 CWG
(using W as either A or T)? Will SNP-CROP process a command line with only one enzyme specified without crashing due to uninitialized$enz2
? And am I using the correct restriction sequence? Seems to me that the code is looking for matches to both restriction sequences and reporting a ~50% failure rate.... Here is a summary:SUMMARY: Total raw reads = 218,152,893 Reads with no identifiable restriction site = 123,456,905 Reads with no identifiable barcode = 13,236,709 Usable reads = 81,459,279 Percentage of usable reads = 37.34
Thanks for any clarification on this issue.