hepcat72 / CFF

Cluster-free Filtering. Determine which sequences are real in a metagenomic sample.
GNU General Public License v3.0
9 stars 1 forks source link

getReals.pl warning #2

Closed aparada14 closed 9 years ago

aparada14 commented 9 years ago

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4 out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0; TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA lib_2;size=8049;N0=0; TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

hepcat72 commented 9 years ago

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 notifications@github.com wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4 out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0; TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA lib_2;size=8049;N0=0; TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

aparada14 commented 9 years ago

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach notifications@github.com wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 notifications@github.com wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4 out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682.

hepcat72 commented 9 years ago

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 notifications@github.com wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach notifications@github.com wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 notifications@github.com wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4 out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682.

— Reply to this email directly or view it on GitHub.

hepcat72 commented 9 years ago

This issue is resolved with getReals.pl version 1.14.

  1. Fixed file name reported in the warning message (a copy-and-paste-o)
  2. Added a check for validity of the -k value given the number of input files supplied.
  3. Improved the usage/help output regarding the -k and -i relationship.
  4. Added better backward compatibility support for -d.

Thanks aparada for bringing this one up!

Rob

aparada14 commented 9 years ago

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach notifications@github.com wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 notifications@github.com wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz < https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach notifications@github.com wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 notifications@github.com wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4 out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576.

hepcat72 commented 9 years ago

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach notifications@github.com wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 notifications@github.com wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz < https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach notifications@github.com wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 notifications@github.com wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4 out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576.

— Reply to this email directly or view it on GitHub.

aparada14 commented 9 years ago

​ mock_communities.tar.gz https://docs.google.com/file/d/0B5-H_vZle315Rmo0V0JJMnowRWM/edit?usp=drive_web ​ Hi Robert, I am running into a separate problem, now and not sure if this is a problem with my samples or not. I am attempting to run the cff on a group of mock community samples, some are even mock communities (made by amplifying 11 16S clones that had been combined at equal molar concentrations), others are staggered mock communities (27 clones at difference relative abundances), and then two fasta files that I made which represent the either mock community, by having the "perfect" clone sequences at the expected abundances. I ran these, and the first issue I see is that when I look at the mereged_mock.glib.smry file my samples are not named correctly anymore, i.e. my expected even sample looks like an amplified staggered community, and then if I look at the counts per "otu" I am getting back 7 even clones instead of the 11 I put in. Any thoughts? I am including the sequence files I am running the analysis on if you'd like to try it out, maybe I'm doing something fundamentally wrong. I am not sure which output file I have may help you, so let me know and I can send it to you. Thanks in advance again, alma

On Mon, Mar 30, 2015 at 2:39 PM, Robert Leach notifications@github.com wrote:

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach notifications@github.com wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 notifications@github.com wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach < notifications@github.com> wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 notifications@github.com wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4 out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87839814.

hepcat72 commented 9 years ago

Before I get too deep into it, can you confirm that this is the same result you get?

I ran with the default of the shortest length: 307.

Rob

On Mar 31, 2015, at 2:04 PM, aparada14 notifications@github.com wrote:

​ mock_communities.tar.gz https://docs.google.com/file/d/0B5-H_vZle315Rmo0V0JJMnowRWM/edit?usp=drive_web ​ Hi Robert, I am running into a separate problem, now and not sure if this is a problem with my samples or not. I am attempting to run the cff on a group of mock community samples, some are even mock communities (made by amplifying 11 16S clones that had been combined at equal molar concentrations), others are staggered mock communities (27 clones at difference relative abundances), and then two fasta files that I made which represent the either mock community, by having the "perfect" clone sequences at the expected abundances. I ran these, and the first issue I see is that when I look at the mereged_mock.glib.smry file my samples are not named correctly anymore, i.e. my expected even sample looks like an amplified staggered community, and then if I look at the counts per "otu" I am getting back 7 even clones instead of the 11 I put in. Any thoughts? I am including the sequence files I am running the analysis on if you'd like to try it out, maybe I'm doing something fundamentally wrong. I am not sure which output file I have may help you, so let me know and I can send it to you. Thanks in advance again, alma

On Mon, Mar 30, 2015 at 2:39 PM, Robert Leach notifications@github.com wrote:

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach notifications@github.com wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 notifications@github.com wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach < notifications@github.com> wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 notifications@github.com wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4 out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87839814.

— Reply to this email directly or view it on GitHub.

aparada14 commented 9 years ago

I don't see any of your results, did you attach?

On Tue, Mar 31, 2015 at 5:59 PM, Robert Leach notifications@github.com wrote:

Before I get too deep into it, can you confirm that this is the same result you get?

I ran with the default of the shortest length: 307.

Rob

On Mar 31, 2015, at 2:04 PM, aparada14 notifications@github.com wrote:

​ mock_communities.tar.gz < https://docs.google.com/file/d/0B5-H_vZle315Rmo0V0JJMnowRWM/edit?usp=drive_web

​ Hi Robert, I am running into a separate problem, now and not sure if this is a problem with my samples or not. I am attempting to run the cff on a group of mock community samples, some are even mock communities (made by amplifying 11 16S clones that had been combined at equal molar concentrations), others are staggered mock communities (27 clones at difference relative abundances), and then two fasta files that I made which represent the either mock community, by having the "perfect" clone sequences at the expected abundances. I ran these, and the first issue I see is that when I look at the mereged_mock.glib.smry file my samples are not named correctly anymore, i.e. my expected even sample looks like an amplified staggered community, and then if I look at the counts per "otu" I am getting back 7 even clones instead of the 11 I put in. Any thoughts? I am including the sequence files I am running the analysis on if you'd like to try it out, maybe I'm doing something fundamentally wrong. I am not sure which output file I have may help you, so let me know and I can send it to you. Thanks in advance again, alma

On Mon, Mar 30, 2015 at 2:39 PM, Robert Leach notifications@github.com wrote:

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach < notifications@github.com> wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 notifications@github.com wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach < notifications@github.com> wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 < notifications@github.com> wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4 out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87839814.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88298866.

aparada14 commented 9 years ago

I reran the analysis using the 307 length you mentioned, as I had used a larger length before, and this appears to have fixed the issue with getting the right number of clones (at least in what appears to be the even samples), but the names of the samples are still wrong, so I can't really compare the staggered communities.

The length change does make me think of a separate issue, does CFF work well if I were analyzing the ITS, instead of the v4 region, given that the length heterogeneity is meaningful more so than with the 16s?

On Tue, Mar 31, 2015 at 6:10 PM, Alma Parada aparada@usc.edu wrote:

I don't see any of your results, did you attach?

On Tue, Mar 31, 2015 at 5:59 PM, Robert Leach notifications@github.com wrote:

Before I get too deep into it, can you confirm that this is the same result you get?

I ran with the default of the shortest length: 307.

Rob

On Mar 31, 2015, at 2:04 PM, aparada14 notifications@github.com wrote:

​ mock_communities.tar.gz < https://docs.google.com/file/d/0B5-H_vZle315Rmo0V0JJMnowRWM/edit?usp=drive_web

​ Hi Robert, I am running into a separate problem, now and not sure if this is a problem with my samples or not. I am attempting to run the cff on a group of mock community samples, some are even mock communities (made by amplifying 11 16S clones that had been combined at equal molar concentrations), others are staggered mock communities (27 clones at difference relative abundances), and then two fasta files that I made which represent the either mock community, by having the "perfect" clone sequences at the expected abundances. I ran these, and the first issue I see is that when I look at the mereged_mock.glib.smry file my samples are not named correctly anymore, i.e. my expected even sample looks like an amplified staggered community, and then if I look at the counts per "otu" I am getting back 7 even clones instead of the 11 I put in. Any thoughts? I am including the sequence files I am running the analysis on if you'd like to try it out, maybe I'm doing something fundamentally wrong. I am not sure which output file I have may help you, so let me know and I can send it to you. Thanks in advance again, alma

On Mon, Mar 30, 2015 at 2:39 PM, Robert Leach <notifications@github.com

wrote:

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach < notifications@github.com> wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 <notifications@github.com

wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach < notifications@github.com> wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 < notifications@github.com> wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87839814.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88298866.

hepcat72 commented 9 years ago

Zipped the results. Hopefully you'll get the attachment this time...

On Mar 31, 2015, at 9:33 PM, aparada14 notifications@github.com wrote:

I reran the analysis using the 307 length you mentioned, as I had used a larger length before, and this appears to have fixed the issue with getting the right number of clones (at least in what appears to be the even samples), but the names of the samples are still wrong, so I can't really compare the staggered communities.

The length change does make me think of a separate issue, does CFF work well if I were analyzing the ITS, instead of the v4 region, given that the length heterogeneity is meaningful more so than with the 16s?

On Tue, Mar 31, 2015 at 6:10 PM, Alma Parada aparada@usc.edu wrote:

I don't see any of your results, did you attach?

On Tue, Mar 31, 2015 at 5:59 PM, Robert Leach notifications@github.com wrote:

Before I get too deep into it, can you confirm that this is the same result you get?

I ran with the default of the shortest length: 307.

Rob

On Mar 31, 2015, at 2:04 PM, aparada14 notifications@github.com wrote:

​ mock_communities.tar.gz < https://docs.google.com/file/d/0B5-H_vZle315Rmo0V0JJMnowRWM/edit?usp=drive_web

​ Hi Robert, I am running into a separate problem, now and not sure if this is a problem with my samples or not. I am attempting to run the cff on a group of mock community samples, some are even mock communities (made by amplifying 11 16S clones that had been combined at equal molar concentrations), others are staggered mock communities (27 clones at difference relative abundances), and then two fasta files that I made which represent the either mock community, by having the "perfect" clone sequences at the expected abundances. I ran these, and the first issue I see is that when I look at the mereged_mock.glib.smry file my samples are not named correctly anymore, i.e. my expected even sample looks like an amplified staggered community, and then if I look at the counts per "otu" I am getting back 7 even clones instead of the 11 I put in. Any thoughts? I am including the sequence files I am running the analysis on if you'd like to try it out, maybe I'm doing something fundamentally wrong. I am not sure which output file I have may help you, so let me know and I can send it to you. Thanks in advance again, alma

On Mon, Mar 30, 2015 at 2:39 PM, Robert Leach <notifications@github.com

wrote:

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach < notifications@github.com> wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 <notifications@github.com

wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach < notifications@github.com> wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 < notifications@github.com> wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87839814.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88298866.

— Reply to this email directly or view it on GitHub.

hepcat72 commented 9 years ago

Hi Alma,

Do let me know if your results match the ones I sent you this morning in the previously attached zip file. If you did not receive the attachment in the second email, github might be stripping them out, so here's a link to the files in my dropbox:

https://dl.dropboxusercontent.com/u/87939936/WORK/4_reals_table.zip

If your sample names are incorrect, I suspect that there may be a file order consistency issue between the different calls of the various scripts, particularly when it comes to the getReals.pl step, which takes multiple of 2 kinds of files. The order of the supplied files must be the same so that when they are processed in tandem, the sample files of one type are associated with the correct sample files of the other type. File names are not compared to match them up in order to provide flexibility, but if this turns out to be the issue, I may implement a warning when file names do not match.

Further, since you were using the deprecated -d flag before, you might be following the command examples in the older tcsh script (which used shell expansion for one type of input file and not for the other - which is disallowed in the latest version of getReals). Thus, you might also possibly have an old version of getReals.pl hanging around which was more susceptible to the file order issue (i.e. it was easier to encounter the file order issue on different operating systems because of inconsistencies in shell expansions between the command line and the way perl's bash glob does it).

So let me ask a few follow-up questions:

  1. Are you calling getReals.pl from inside a shell script? If you're using a copy of one of the provided shell scripts, was there a version number at the top and if so, what was it? The latest one is version 1.3.
  2. What do you get when you run getReals.pl with the --version flag. Call it the way you're calling it in your pipeline, e.g. if you're using an absolute path or relative path or just the script name - or if you're running it inside a script, just add --version and either comment out other calls or exit just after the getReals.pl step.
  3. Do you have any user defaults set for getReals.pl? You can determine this by running getReals.pl without any options. Thehe last line indicates any defaults such as is the case with my installation: "Current user defaults: [-y /usr/local/bin/usearch7.0.1090_i86osx32]."
  4. Could you give me the exact command you're using in the getReals.pl call?
  5. And incidentally, what is your shell environment? I.e. What do you get when you run echo $SHELL?

Thanks, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 31, 2015, at 9:33 PM, aparada14 notifications@github.com wrote:

I reran the analysis using the 307 length you mentioned, as I had used a larger length before, and this appears to have fixed the issue with getting the right number of clones (at least in what appears to be the even samples), but the names of the samples are still wrong, so I can't really compare the staggered communities.

The length change does make me think of a separate issue, does CFF work well if I were analyzing the ITS, instead of the v4 region, given that the length heterogeneity is meaningful more so than with the 16s?

On Tue, Mar 31, 2015 at 6:10 PM, Alma Parada aparada@usc.edu wrote:

I don't see any of your results, did you attach?

On Tue, Mar 31, 2015 at 5:59 PM, Robert Leach notifications@github.com wrote:

Before I get too deep into it, can you confirm that this is the same result you get?

I ran with the default of the shortest length: 307.

Rob

On Mar 31, 2015, at 2:04 PM, aparada14 notifications@github.com wrote:

​ mock_communities.tar.gz < https://docs.google.com/file/d/0B5-H_vZle315Rmo0V0JJMnowRWM/edit?usp=drive_web

​ Hi Robert, I am running into a separate problem, now and not sure if this is a problem with my samples or not. I am attempting to run the cff on a group of mock community samples, some are even mock communities (made by amplifying 11 16S clones that had been combined at equal molar concentrations), others are staggered mock communities (27 clones at difference relative abundances), and then two fasta files that I made which represent the either mock community, by having the "perfect" clone sequences at the expected abundances. I ran these, and the first issue I see is that when I look at the mereged_mock.glib.smry file my samples are not named correctly anymore, i.e. my expected even sample looks like an amplified staggered community, and then if I look at the counts per "otu" I am getting back 7 even clones instead of the 11 I put in. Any thoughts? I am including the sequence files I am running the analysis on if you'd like to try it out, maybe I'm doing something fundamentally wrong. I am not sure which output file I have may help you, so let me know and I can send it to you. Thanks in advance again, alma

On Mon, Mar 30, 2015 at 2:39 PM, Robert Leach <notifications@github.com

wrote:

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach < notifications@github.com> wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 <notifications@github.com

wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach < notifications@github.com> wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 < notifications@github.com> wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87839814.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88298866.

— Reply to this email directly or view it on GitHub.

hepcat72 commented 9 years ago

Oh yeah, and to answer your questions, CFF only works on sequences that are all the same length. I've been meaning to add a feature that detects the smallest length among all the input files (in mergeSeqs.pl) but currently, the 'auto' function for trim size chooses the smallest sequence length independently in each file, which is problematic for the rest of the pipeline. If you were using the auto trim size feature, that is why you were having the "number of clones" problem. The README appears to not make this clear, relying on people to use the provided pipeline scripts which require a trim size. When I had discussed a default trim size feature for mergeSeqs.pl with our team, the consensus had been to not provide one, however it didn't end up being highlighted in the documentation that a global trim size is required. I will address this issue shortly in the code.

Regarding ITS sequences, the only requirement as to sequence type is that the sequences must come out of the sequencer as if they were already aligned (but no gap characters). Basically, it means that they are the result of PCR, meaning the reads all start from the end of a primer (no reverse complements or reads from the primer's mate).

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 31, 2015, at 9:33 PM, aparada14 notifications@github.com wrote:

I reran the analysis using the 307 length you mentioned, as I had used a larger length before, and this appears to have fixed the issue with getting the right number of clones (at least in what appears to be the even samples), but the names of the samples are still wrong, so I can't really compare the staggered communities.

The length change does make me think of a separate issue, does CFF work well if I were analyzing the ITS, instead of the v4 region, given that the length heterogeneity is meaningful more so than with the 16s?

On Tue, Mar 31, 2015 at 6:10 PM, Alma Parada aparada@usc.edu wrote:

I don't see any of your results, did you attach?

On Tue, Mar 31, 2015 at 5:59 PM, Robert Leach notifications@github.com wrote:

Before I get too deep into it, can you confirm that this is the same result you get?

I ran with the default of the shortest length: 307.

Rob

On Mar 31, 2015, at 2:04 PM, aparada14 notifications@github.com wrote:

​ mock_communities.tar.gz < https://docs.google.com/file/d/0B5-H_vZle315Rmo0V0JJMnowRWM/edit?usp=drive_web

​ Hi Robert, I am running into a separate problem, now and not sure if this is a problem with my samples or not. I am attempting to run the cff on a group of mock community samples, some are even mock communities (made by amplifying 11 16S clones that had been combined at equal molar concentrations), others are staggered mock communities (27 clones at difference relative abundances), and then two fasta files that I made which represent the either mock community, by having the "perfect" clone sequences at the expected abundances. I ran these, and the first issue I see is that when I look at the mereged_mock.glib.smry file my samples are not named correctly anymore, i.e. my expected even sample looks like an amplified staggered community, and then if I look at the counts per "otu" I am getting back 7 even clones instead of the 11 I put in. Any thoughts? I am including the sequence files I am running the analysis on if you'd like to try it out, maybe I'm doing something fundamentally wrong. I am not sure which output file I have may help you, so let me know and I can send it to you. Thanks in advance again, alma

On Mon, Mar 30, 2015 at 2:39 PM, Robert Leach <notifications@github.com

wrote:

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach < notifications@github.com> wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 <notifications@github.com

wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach < notifications@github.com> wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 < notifications@github.com> wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87839814.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88298866.

— Reply to this email directly or view it on GitHub.

aparada14 commented 9 years ago

Hi, Your results look like mine, the numbers in the tail are a little different but that's all, otherwise the mix up with the sample names is the same.

To answer your questions 1) I am running the python scripts individually

2) getReals.pl --version

getReals.pl Version 1.13

Created: 5/19/2014

Last modified: Fri Mar 27 14:46:45 2015

3) I don't see any line like yours for the defaults, the command requires I use the -i, -n, and -f commands, so I ran without the -k only and didn't get a line like yours, maybe i'm not doing it right?

4) getReals.pl -i 'out_directory5/.lib.n0s.cands' -n 'out_directory5/.lib.n0s' -f out_directory_5/merged_mock.glib -k 2

5)/bin/bash

On Wed, Apr 1, 2015 at 7:50 AM, Robert Leach notifications@github.com wrote:

Hi Alma,

Do let me know if your results match the ones I sent you this morning in the previously attached zip file. If you did not receive the attachment in the second email, github might be stripping them out, so here's a link to the files in my dropbox:

https://dl.dropboxusercontent.com/u/87939936/WORK/4_reals_table.zip

If your sample names are incorrect, I suspect that there may be a file order consistency issue between the different calls of the various scripts, particularly when it comes to the getReals.pl step, which takes multiple of 2 kinds of files. The order of the supplied files must be the same so that when they are processed in tandem, the sample files of one type are associated with the correct sample files of the other type. File names are not compared to match them up in order to provide flexibility, but if this turns out to be the issue, I may implement a warning when file names do not match.

Further, since you were using the deprecated -d flag before, you might be following the command examples in the older tcsh script (which used shell expansion for one type of input file and not for the other - which is disallowed in the latest version of getReals). Thus, you might also possibly have an old version of getReals.pl hanging around which was more susceptible to the file order issue (i.e. it was easier to encounter the file order issue on different operating systems because of inconsistencies in shell expansions between the command line and the way perl's bash glob does it).

So let me ask a few follow-up questions:

  1. Are you calling getReals.pl from inside a shell script? If you're using a copy of one of the provided shell scripts, was there a version number at the top and if so, what was it? The latest one is version 1.3.
  2. What do you get when you run getReals.pl with the --version flag. Call it the way you're calling it in your pipeline, e.g. if you're using an absolute path or relative path or just the script name - or if you're running it inside a script, just add --version and either comment out other calls or exit just after the getReals.pl step.
  3. Do you have any user defaults set for getReals.pl? You can determine this by running getReals.pl without any options. Thehe last line indicates any defaults such as is the case with my installation: "Current user defaults: [-y /usr/local/bin/usearch7.0.1090_i86osx32]."
  4. Could you give me the exact command you're using in the getReals.pl call?
  5. And incidentally, what is your shell environment? I.e. What do you get when you run echo $SHELL?

Thanks,

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 31, 2015, at 9:33 PM, aparada14 notifications@github.com wrote:

I reran the analysis using the 307 length you mentioned, as I had used a larger length before, and this appears to have fixed the issue with getting the right number of clones (at least in what appears to be the even samples), but the names of the samples are still wrong, so I can't really compare the staggered communities.

The length change does make me think of a separate issue, does CFF work well if I were analyzing the ITS, instead of the v4 region, given that the length heterogeneity is meaningful more so than with the 16s?

On Tue, Mar 31, 2015 at 6:10 PM, Alma Parada aparada@usc.edu wrote:

I don't see any of your results, did you attach?

On Tue, Mar 31, 2015 at 5:59 PM, Robert Leach < notifications@github.com> wrote:

Before I get too deep into it, can you confirm that this is the same result you get?

I ran with the default of the shortest length: 307.

Rob

On Mar 31, 2015, at 2:04 PM, aparada14 notifications@github.com wrote:

​ mock_communities.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315Rmo0V0JJMnowRWM/edit?usp=drive_web

​ Hi Robert, I am running into a separate problem, now and not sure if this is a problem with my samples or not. I am attempting to run the cff on a group of mock community samples, some are even mock communities (made by amplifying 11 16S clones that had been combined at equal molar concentrations), others are staggered mock communities (27 clones at difference relative abundances), and then two fasta files that I made which represent the either mock community, by having the "perfect" clone sequences at the expected abundances. I ran these, and the first issue I see is that when I look at the mereged_mock.glib.smry file my samples are not named correctly anymore, i.e. my expected even sample looks like an amplified staggered community, and then if I look at the counts per "otu" I am getting back 7 even clones instead of the 11 I put in. Any thoughts? I am including the sequence files I am running the analysis on if you'd like to try it out, maybe I'm doing something fundamentally wrong. I am not sure which output file I have may help you, so let me know and I can send it to you. Thanks in advance again, alma

On Mon, Mar 30, 2015 at 2:39 PM, Robert Leach < notifications@github.com

wrote:

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach < notifications@github.com> wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 < notifications@github.com

wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach < notifications@github.com> wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 < notifications@github.com> wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87839814.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88298866.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88511976.

hepcat72 commented 9 years ago

Thanks! This helps. Thanks for baring with me. I appreciate the feedback so that I can improve CFF.

Your results look like mine, the numbers in the tail are a little different but that's all, otherwise the mix up with the sample names is the same.

Yes, indeed, you are correct. I see the problem. The column headers are incorrect. The "in.silico" column headers at the end should be at the beginning. I will issue a fix for this today, however I believe you can move forward as long as you know that the column headers are off. The headers do not affect the results or the downstream running. They are just there as a courtesy. I verified it this way:

I grepped for the lib_32 sequence. The counts in each file, computed by the grep, match up with those in the summary file other than the aforementioned header mis-ordering.

grep -c -i GACCAGCACCTCAAGTGGTCAGGATGATTATTGGGCCTAAAGCATCCGTAGCCGGCTCTGTAAGTTTTCGGTTAAATCTGTACGCTCAACGTACAGGCTGCCGGGAATACTGCAAAGCTAGGGAGTGGGAGAAGTAGACGGTACTCGGTAGGAAGTGGTAAAATGCTTTGATCTATCGATGACCACCTGTGGCGAAGGCGGTCTACTAGAACGCGTCCGACGGTGAGGGATGAAAGCTGGGGGAGCAAACCGGATTAGATACCCGGGTAGTCCCAGCTGTAAACTATGCAAACTCAGTGATGCATTG /Users/rleach/Downloads/Web/mock_communities/*.fasta M.Even.A.C.926R.fasta:3 M.Even.F.Y.926R.fasta:2 M.Even.G.C.926R.fasta:13 M.Even.H.Y.926R.fasta:26 M.Even.I.C.926R.fasta:12 M.Even.J.Y.926R.fasta:1 M.Staggered.A.C.926R.fasta:6 M.Staggered.F.Y.926R.fasta:13 M.Staggered.G.C.926R.fasta:12 M.Staggered.H.Y.926R.fasta:102 M.Staggered.I.C.926R.fasta:36 M.Staggered.J.Y.926R.fasta:6 in.silico.even.fasta:0 in.silico.stag.fasta:0

Then I grepped for the lib_137 sequence to see if it was consistent:

grep -c -i AACCAGCACCTCAAGTGGTCAGGATGATTATTGGGCCTAAAGCATCCGTAGCCGGCTCTGTAAGTTTTCGGTTAAATCTGTACGCTCAACGTACAGGCTGCCGGGAATACTGCAAAGCTAGGGAGTGGGAGAAGTAGACGGTACTCGGTAGGAAGTGGTAAAATGCTTTGATCTATCGATGACCACCTGTGGCGAAGGCGGTCTACTAGAACGCGTCCGACGGTGTGGGATGAAAGCTGGGGGAGCAAACCGGATTAGATACCCGGGTAGTCCCAGCTGTAAACTATGCAAACTCAGTGATGCATTG /Users/rleach/Downloads/Web/mock_communities/*.fastaM.Even.A.C.926R.fasta:4 M.Even.F.Y.926R.fasta:2 M.Even.G.C.926R.fasta:3 M.Even.H.Y.926R.fasta:6 M.Even.I.C.926R.fasta:10 M.Even.J.Y.926R.fasta:4 M.Staggered.A.C.926R.fasta:7 M.Staggered.F.Y.926R.fasta:4 M.Staggered.G.C.926R.fasta:7 M.Staggered.H.Y.926R.fasta:12 M.Staggered.I.C.926R.fasta:10 M.Staggered.J.Y.926R.fasta:2 in.silico.even.fasta:0 in.silico.stag.fasta:0

So this is not the file ordering issue I had suspected. I'll let you know when the fix is published, but as I mentioned, as long as you manually correct the headers, you should be good to go. Correcting them however is just a cosmetic correction, so it's not necessary.

The column header line should go from this:

ID M.Even.A.C.926R M.Even.F.Y.926R M.Even.G.C.926R M.Even.H.Y.926R M.Even.I.C.926R M.Even.J.Y.926R M.Staggered.A.C.926R M.Staggered.F.Y.926R M.Staggered.G.C.926R M.Staggered.H.Y.926R M.Staggered.I.C.926R M.Staggered.J.Y.926R in.silico.even in.silico.stag

To this:

ID in.silico.even in.silico.stag M.Even.A.C.926R M.Even.F.Y.926R M.Even.G.C.926R M.Even.H.Y.926R M.Even.I.C.926R M.Even.J.Y.926R M.Staggered.A.C.926R M.Staggered.F.Y.926R M.Staggered.G.C.926R M.Staggered.H.Y.926R M.Staggered.I.C.926R M.Staggered.J.Y.926R

Let me know if you encounter any other problems.

Thanks again, Rob

On Wed, Apr 1, 2015 at 7:50 AM, Robert Leach notifications@github.com wrote:

Hi Alma,

Do let me know if your results match the ones I sent you this morning in the previously attached zip file. If you did not receive the attachment in the second email, github might be stripping them out, so here's a link to the files in my dropbox:

https://dl.dropboxusercontent.com/u/87939936/WORK/4_reals_table.zip

If your sample names are incorrect, I suspect that there may be a file order consistency issue between the different calls of the various scripts, particularly when it comes to the getReals.pl step, which takes multiple of 2 kinds of files. The order of the supplied files must be the same so that when they are processed in tandem, the sample files of one type are associated with the correct sample files of the other type. File names are not compared to match them up in order to provide flexibility, but if this turns out to be the issue, I may implement a warning when file names do not match.

Further, since you were using the deprecated -d flag before, you might be following the command examples in the older tcsh script (which used shell expansion for one type of input file and not for the other - which is disallowed in the latest version of getReals). Thus, you might also possibly have an old version of getReals.pl hanging around which was more susceptible to the file order issue (i.e. it was easier to encounter the file order issue on different operating systems because of inconsistencies in shell expansions between the command line and the way perl's bash glob does it).

So let me ask a few follow-up questions:

  1. Are you calling getReals.pl from inside a shell script? If you're using a copy of one of the provided shell scripts, was there a version number at the top and if so, what was it? The latest one is version 1.3.
  2. What do you get when you run getReals.pl with the --version flag. Call it the way you're calling it in your pipeline, e.g. if you're using an absolute path or relative path or just the script name - or if you're running it inside a script, just add --version and either comment out other calls or exit just after the getReals.pl step.
  3. Do you have any user defaults set for getReals.pl? You can determine this by running getReals.pl without any options. Thehe last line indicates any defaults such as is the case with my installation: "Current user defaults: [-y /usr/local/bin/usearch7.0.1090_i86osx32]."
  4. Could you give me the exact command you're using in the getReals.pl call?
  5. And incidentally, what is your shell environment? I.e. What do you get when you run echo $SHELL?

Thanks,

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 31, 2015, at 9:33 PM, aparada14 notifications@github.com wrote:

I reran the analysis using the 307 length you mentioned, as I had used a larger length before, and this appears to have fixed the issue with getting the right number of clones (at least in what appears to be the even samples), but the names of the samples are still wrong, so I can't really compare the staggered communities.

The length change does make me think of a separate issue, does CFF work well if I were analyzing the ITS, instead of the v4 region, given that the length heterogeneity is meaningful more so than with the 16s?

On Tue, Mar 31, 2015 at 6:10 PM, Alma Parada aparada@usc.edu wrote:

I don't see any of your results, did you attach?

On Tue, Mar 31, 2015 at 5:59 PM, Robert Leach < notifications@github.com> wrote:

Before I get too deep into it, can you confirm that this is the same result you get?

I ran with the default of the shortest length: 307.

Rob

On Mar 31, 2015, at 2:04 PM, aparada14 notifications@github.com wrote:

​ mock_communities.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315Rmo0V0JJMnowRWM/edit?usp=drive_web

​ Hi Robert, I am running into a separate problem, now and not sure if this is a problem with my samples or not. I am attempting to run the cff on a group of mock community samples, some are even mock communities (made by amplifying 11 16S clones that had been combined at equal molar concentrations), others are staggered mock communities (27 clones at difference relative abundances), and then two fasta files that I made which represent the either mock community, by having the "perfect" clone sequences at the expected abundances. I ran these, and the first issue I see is that when I look at the mereged_mock.glib.smry file my samples are not named correctly anymore, i.e. my expected even sample looks like an amplified staggered community, and then if I look at the counts per "otu" I am getting back 7 even clones instead of the 11 I put in. Any thoughts? I am including the sequence files I am running the analysis on if you'd like to try it out, maybe I'm doing something fundamentally wrong. I am not sure which output file I have may help you, so let me know and I can send it to you. Thanks in advance again, alma

On Mon, Mar 30, 2015 at 2:39 PM, Robert Leach < notifications@github.com

wrote:

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach < notifications@github.com> wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 < notifications@github.com

wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach < notifications@github.com> wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 < notifications@github.com> wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87839814.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88298866.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88511976.

— Reply to this email directly or view it on GitHub.

hepcat72 commented 9 years ago

The column header order bug has been fixed. To update your installation, just download the newest version on github and run these commands:

perl Makefile.PL make sudo make install

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Apr 1, 2015, at 1:07 PM, aparada14 notifications@github.com wrote:

Hi, Your results look like mine, the numbers in the tail are a little different but that's all, otherwise the mix up with the sample names is the same.

To answer your questions 1) I am running the python scripts individually

2) getReals.pl --version

getReals.pl Version 1.13

Created: 5/19/2014

Last modified: Fri Mar 27 14:46:45 2015

3) I don't see any line like yours for the defaults, the command requires I use the -i, -n, and -f commands, so I ran without the -k only and didn't get a line like yours, maybe i'm not doing it right?

4) getReals.pl -i 'out_directory5/.lib.n0s.cands' -n 'out_directory5/.lib.n0s' -f out_directory_5/merged_mock.glib -k 2

5)/bin/bash

On Wed, Apr 1, 2015 at 7:50 AM, Robert Leach notifications@github.com wrote:

Hi Alma,

Do let me know if your results match the ones I sent you this morning in the previously attached zip file. If you did not receive the attachment in the second email, github might be stripping them out, so here's a link to the files in my dropbox:

https://dl.dropboxusercontent.com/u/87939936/WORK/4_reals_table.zip

If your sample names are incorrect, I suspect that there may be a file order consistency issue between the different calls of the various scripts, particularly when it comes to the getReals.pl step, which takes multiple of 2 kinds of files. The order of the supplied files must be the same so that when they are processed in tandem, the sample files of one type are associated with the correct sample files of the other type. File names are not compared to match them up in order to provide flexibility, but if this turns out to be the issue, I may implement a warning when file names do not match.

Further, since you were using the deprecated -d flag before, you might be following the command examples in the older tcsh script (which used shell expansion for one type of input file and not for the other - which is disallowed in the latest version of getReals). Thus, you might also possibly have an old version of getReals.pl hanging around which was more susceptible to the file order issue (i.e. it was easier to encounter the file order issue on different operating systems because of inconsistencies in shell expansions between the command line and the way perl's bash glob does it).

So let me ask a few follow-up questions:

  1. Are you calling getReals.pl from inside a shell script? If you're using a copy of one of the provided shell scripts, was there a version number at the top and if so, what was it? The latest one is version 1.3.
  2. What do you get when you run getReals.pl with the --version flag. Call it the way you're calling it in your pipeline, e.g. if you're using an absolute path or relative path or just the script name - or if you're running it inside a script, just add --version and either comment out other calls or exit just after the getReals.pl step.
  3. Do you have any user defaults set for getReals.pl? You can determine this by running getReals.pl without any options. Thehe last line indicates any defaults such as is the case with my installation: "Current user defaults: [-y /usr/local/bin/usearch7.0.1090_i86osx32]."
  4. Could you give me the exact command you're using in the getReals.pl call?
  5. And incidentally, what is your shell environment? I.e. What do you get when you run echo $SHELL?

Thanks,

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 31, 2015, at 9:33 PM, aparada14 notifications@github.com wrote:

I reran the analysis using the 307 length you mentioned, as I had used a larger length before, and this appears to have fixed the issue with getting the right number of clones (at least in what appears to be the even samples), but the names of the samples are still wrong, so I can't really compare the staggered communities.

The length change does make me think of a separate issue, does CFF work well if I were analyzing the ITS, instead of the v4 region, given that the length heterogeneity is meaningful more so than with the 16s?

On Tue, Mar 31, 2015 at 6:10 PM, Alma Parada aparada@usc.edu wrote:

I don't see any of your results, did you attach?

On Tue, Mar 31, 2015 at 5:59 PM, Robert Leach < notifications@github.com> wrote:

Before I get too deep into it, can you confirm that this is the same result you get?

I ran with the default of the shortest length: 307.

Rob

On Mar 31, 2015, at 2:04 PM, aparada14 notifications@github.com wrote:

​ mock_communities.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315Rmo0V0JJMnowRWM/edit?usp=drive_web

​ Hi Robert, I am running into a separate problem, now and not sure if this is a problem with my samples or not. I am attempting to run the cff on a group of mock community samples, some are even mock communities (made by amplifying 11 16S clones that had been combined at equal molar concentrations), others are staggered mock communities (27 clones at difference relative abundances), and then two fasta files that I made which represent the either mock community, by having the "perfect" clone sequences at the expected abundances. I ran these, and the first issue I see is that when I look at the mereged_mock.glib.smry file my samples are not named correctly anymore, i.e. my expected even sample looks like an amplified staggered community, and then if I look at the counts per "otu" I am getting back 7 even clones instead of the 11 I put in. Any thoughts? I am including the sequence files I am running the analysis on if you'd like to try it out, maybe I'm doing something fundamentally wrong. I am not sure which output file I have may help you, so let me know and I can send it to you. Thanks in advance again, alma

On Mon, Mar 30, 2015 at 2:39 PM, Robert Leach < notifications@github.com

wrote:

No problem. Let me know if you have any more trouble.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 5:23 PM, aparada14 notifications@github.com wrote:

Great, I will try it now, thanks for hte quick help!

On Mon, Mar 30, 2015 at 8:27 AM, Robert Leach < notifications@github.com> wrote:

Hi,

Alright, I see the problem. First, the warning is reporting the wrong file name. The file with the missing N0 values is out_directory_2/all_926R_methods_checked_seqs.fna.lib, supplied with -d. A new feature of getReals.pl, added recently by request, is the ability to produce a .n0smry file (similar to the .smry file, but with N0 values instead of counts). I had kept the -d option (renamed: -n or --n0-files) for backward compatibility, but apparently missed the fact that this would generate warnings (and imprecise ones, at that) with older pipelines. I will implement a fix today, but note that your output, except for the .n0smry file, is technically correct.

That said, I should be catching a use-case I did not anticipate - and I should definitely be throwing an error and improve the usage output to make this clearer. You indicated -k 2, however this requires at least 2 files supplied to -i and 2 files supplied to -d. Think of these files as sample files. Each one can nominate a set of candidates. -k 2 means that a candidate must be nominated at least twice to be considered real. So here is how getReals should be called:

getReals.pl -i 'out_directory2/.lib.n0s.cands' -n 'out_directory2/.lib.n0s' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

You can take a look at the example in the latest version of the tcsh scripts. Sorry for the confusing warnings! Thanks so much for the bug report.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Mar 30, 2015, at 12:33 AM, aparada14 < notifications@github.com

wrote:

Hi, Yes you are right I get 50 errors of the same kind, then it says:​​ WARNING50: NOTE: Further warnings of this type will be suppressed. WARNING50: Set --error-type-limit to 0 to turn off error suppression

Done. EXIT STATUS: [ERRORS: 0 WARNINGS: 104699 TIME: 17s] Scroll up to inspect full errors/warnings in-place.

I have attached the out directory I was using, I didn't include the original sequence file, since it's large, but let me know if you require it, thanks again with your help.​ cff_out_directory.tar.gz <

https://docs.google.com/file/d/0B5-H_vZle315cV9iZTdmdW8tUUk/edit?usp=drive_web

On Sun, Mar 29, 2015 at 8:51 PM, Robert Leach < notifications@github.com> wrote:

Hi,

I'd be happy to help. The pattern looks good for the file on which you called head. If my code for the warning is accurate, the file reported in the warning appears to be missing N0 values (out_directory_2/test.fna.lib.n0s.cands), at least on that one defline (>lib_45;size=585;). Since it is warning number 45, and it is sequence lib_45, I'm guessing the previous 44 defines are also missing the N0 values, but the file you pasted does have N0 values.

If you'd like to send me the files, I can run them to see if I can reproduce your issue.

Cheers, Rob

Sent from my iPad

On Mar 29, 2015, at 11:04 PM, aparada14 < notifications@github.com> wrote:

Hello, I am running CFF on a fasta file, and going through each of the scripts. When trying to run getReals.pl, I am running into the warning below, which I think is causing my all_926R_methods_checked_seqs.glib.reals file to be empty. I am not familiar with perl regular expressions, so help in fixing this warning would be greatly appreciated thank you!

WARNING45: Unable to parse N0 from defline: [>lib_45;size=585;] in file: [out_directory_2/test.fna.lib.n0s.cands] using pattern: [N0=([^;]+);]. Please either fix the defline or use a different pattern (-l) to extract the N0 value.

I am using this command, getReals.pl -i

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands -d 'out_directory_2/all_926R_methods_checked_seqs.fna.lib' -f out_directory_2/all_926R_methods_checked_seqs.glib -k 2

and the result of head -n 4

out_directory_2/all_926R_methods_checked_seqs.fna.lib.n0s.cands gives

lib_1;size=21938;N0=0;

TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGCTCGTAGGTGGTTAAAAAAGTTGATGGTGAAATCCCAAGGCTCAACCTTGGAACTGCCATCAAAACTTTTTAGCTAGAGTGTGATAGAGGTAAGTGGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAACACCAAATGCGAAGGCAACTTACTGGGTCACTACTGACACTGAGGAGCGAAAGCATGGGTAGCGAAGAGGATTAGATACCCTCGTAGTCCATGCCGTAAACGATGTGTGCTAGACGTTGGAAATATATTTTTCAGTGTCGCAGCGAAAGCATTAAGCACACCGCCTGGGGAGTACGACCGCAAGGTTA

lib_2;size=8049;N0=0;

TACGAAGGGGGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCTCGCAGGCTGCTTGAACAGTTAGACGTGAAATCCCCGGGCTCAACCTGGGAACTGCGTTTAATACTAGCAAGCTAGAGAAATAGAGAGGAAAGTGGAACTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCAGTGGCGAAAGCGACTTTCTGGCTATTTTCTGACGCTGAGGAGCGAAAGCGTGGGGAGCAAACAGGGTTAGATACCCTGGTAGTCCACGCCGTAAACGATGTGTGCTAGATGTTGGAAGGTTACCTTTCAGTGTCGCAGCTAACGCACTAAGCACACCGCCTGGGAAGTACGGTCGCAAGATTA

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87534682>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/hepcat72/CFF/issues/2#issuecomment-87722576>.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-87839814.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88298866.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/hepcat72/CFF/issues/2#issuecomment-88511976.

— Reply to this email directly or view it on GitHub.