hepcat72 / CFF

Cluster-free Filtering. Determine which sequences are real in a metagenomic sample.
GNU General Public License v3.0
9 stars 1 forks source link

ERROR: Command getReals.pl failed #4

Closed kojiyasuda closed 9 years ago

kojiyasuda commented 9 years ago

It looks like my run was terminated at "getReals" step and I am not sure how to fix this. Does this error make sense to you? I placed my input fna file containing 16S V4, Illumina stitched 250nt amplicons into Caporaso_FSTA folder(after removing the original examples) since the example1 ran nicely and I didn't know how to run a new command. Hence it says Caporaso_FASTA, but it actually contained my fna file that I renamed to one of the example fna files. Thanks for looking into this! Koji


/Users/kojiyasuda/Desktop/CFF-master/samples/run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Jun 15 18:57:44 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out mergeSeqs.pl 'Caporaso_FASTA/L6S2?_19???.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p '' -- 1989 seconds neighbors.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -o .nbrs -- 995 seconds errorRates.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -z 2 -o .erates -- 262 seconds nZeros.pl 'Caporaso_FASTA_out/1_lib/{L6S2?_19???.fna}.lib' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -r 'Caporaso_FASTA_out/1_lib/global_library.fna.erates' -o .n0s --outdir 'Caporaso_FASTA_out/2_n0s' -- 348 seconds getCandidates.pl 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -o .cands -h 10 --outdir 'Caporaso_FASTA_out/3_cands' -- 256 seconds getReals.pl -i 'Caporaso_FASTA_out/3_cands/{L6S2?_19???.fna}.lib.n0s.cands' -n 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -f 'Caporaso_FASTA_out/1_lib/global_library.fna' -k 2 --outdir 'Caporaso_FASTA_out/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

hepcat72 commented 9 years ago

Hi Koji,

Thanks for using CFF. Let me explain how getReals.pl works, but first, let me back up to the previous step....

getCandidates.pl selects sequences it thinks are real by comparing each sequence with the abundance you would expect to get if it was an erroneous sequence (as compared to neighboring sequences that differ only by 1 base). It uses the estimated error rate of each nucleotide change to predict the likelihood that its neighbors were misread during PCR to produce it. If the sequence's actual abundance is above the erroneous predicted abundance by a given threshold, it is assumed to be real.

Now getReals.pl takes it a step further. getReals.pl looks across multiple samples (e.g. a time series) and considers all the candidates from different sample files and if it sees the same candidate sequence a threshold number of times, it assumes the sequence is real. It also does filtering for chimeras.

So, running getReals.pl requires multiple sample files (at least 2) in order to run. Essentially, your results with a single sample file are the candidate files.

If you wish to perform chimera filtering on a single sample file, currently, you would have to run uchime on your own. getReals.pl runs ucime in 2 separate calls:

usearch -usearch_global $tmp_lib -db $gcand_file -id 1.0 -idprefix $seqlen -matched $cands_with_global_abund -strand plus -quiet

usearch -uchime_denovo $cands_with_global_abund -minuniquesize 2 -nonchimeras $tmp_out_file $aln_arg -quiet

where:

$tmp_lib = a temporary library file output in the first step. $gcand_file = your candidates file. $seqlen = the length of your sequences. $cands_with_global_abund = a temporary candidate file output in the first step. $tmp_out_file = your chimera output file. $aln_arg = your chimera alignment output file.

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 16, 2015, at 7:05 AM, kojiyasuda notifications@github.com wrote:

It looks like my run was terminated at "getReals" step and I am not sure how to fix this. Does this error make sense to you? I placed my input fna file containing 16S V4, Illumina stitched 250nt amplicons into Caporaso_FSTA folder(after removing the original examples) since the example1 ran nicely and I didn't know how to run a new command. Hence it says Caporaso_FASTA, but it actually contained my fna file that I renamed to one of the example fna files. Thanks for looking into this! Koji

/Users/kojiyasuda/Desktop/CFF-master/samples/run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Jun 15 18:57:44 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out mergeSeqs.pl 'Caporaso_FASTA/L6S2?_19???.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p '' -- 1989 seconds neighbors.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -o .nbrs -- 995 seconds errorRates.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -z 2 -o .erates -- 262 seconds nZeros.pl 'Caporaso_FASTA_out/1_lib/{L6S2?_19???.fna}.lib' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -r 'Caporaso_FASTA_out/1_lib/global_library.fna.erates' -o .n0s --outdir 'Caporaso_FASTA_out/2_n0s' -- 348 seconds getCandidates.pl 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -o .cands -h 10 --outdir 'Caporaso_FASTA_out/3_cands' -- 256 seconds getReals.pl -i 'Caporaso_FASTA_out/3_cands/{L6S2?_19???.fna}.lib.n0s.cands' -n 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -f 'Caporaso_FASTA_out/1_lib/global_library.fna' -k 2 --outdir 'Caporaso_FASTA_out/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

— Reply to this email directly or view it on GitHub.

kojiyasuda commented 9 years ago

Hi Rob,

Thank you so much for explaining the steps. It looks like we would have to split the files by samples, since the one input file used contain all our sequences. We will try splitting the files by samples (in fna format) and will try it again. Or do you have a flag to tell the script to calculate between samples (i.e “54” from “44” see below) from one file?

Thank you again for your kind help. Koji

54_0 M01032:110:000000000-A5D3N:1:1101:14044:1587 1:N:0:0 orig_bc=GAGGCTCATCAT new_bc=GAGGCTCATCAT bc_diffs=0 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTAATAAGTCTGAAGTTAAAGGCAGTGGCTTAACCATTGTTCGCTTTGGAAACTGTTAGACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGT AGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCTGTAGTCCGGC 44_1 M01032:110:000000000-A5D3N:1:1101:16826:1587 1:N:0:0 orig_bc=AAGGAGCGCCTT new_bc=AAGGAGCGCCTT bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCTGACAAGTCAGCGGTGAAATGTCCACGCTCAACGTGGAAAGTGCCGTTGAAACTGCCGGACTAGAATTCGGATGCCGTGGGAGGAATGTGTAGTGTAGCGGTGAAATGCT TAGATATTACACAGAACACCGATTGCGAAGGCATCTCACGAATCCGACATTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGGATTAGATACCCCTGTAGTCCGG 57_2 M01032:110:000000000-A5D3N:1:1101:16915:1587 1:N:0:0 orig_bc=AGTAGAGGGATG new_bc=AGTAGAGGGATG bc_diffs=0 TACGGAAGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCGGTCAAATGTCAGGGCCCAACCTTGGCCTGCCGTTGATACTGGTAGTCTTGAATACACACAAGGAAGATGGAATTCGTCGTGTAGCGGTGAGATGCTT AGATATGACGAAGAACTCCGATTGCGAAGGCAGTCTTCTGGGGTGCGATTGACGCTGAGGCTCGAAAGTGCGGGAATCAAACAGGATTAGAAACCCCAGTAGTCCGGC 14_3 M01032:110:000000000-A5D3N:1:1101:15513:1588 1:N:0:0 orig_bc=CCTCGTTCGACT new_bc=CCTCGTTCGACT bc_diffs=0 TTAGATACCCTAGTAGTCCGGCTGACTGACTCCTCGTTCGACTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACCGGATCCTAACTCCGGAACTGCCGATGATACAGATGTGCTGGAATACAGATGCCGTGGGATCAATTAGTAGTGTATCGGTGAAACACA TAGATATTACTCAGAACACCGATTGCGAAGTCATCTCACGAAGCAGGTATTGACGCTGATGCACGAAAACGTGGGGATCAAACAACACTAGAAACCCCAGAATGCCGG 20_4 M01032:110:000000000-A5D3N:1:1101:15485:1588 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCGCTTTAAGTCAGTGGTCAAATCGTGAGGCTCAACCTCATCCCGCCATTGATACTGGAGCGCTTGATTGCGGTTGAGGTAGGCGGAATTCGTCGTGTAGCGGTGAAATGCAT AGATATGACGAAGAACCCCGATTGCGTAGGCAGCTTACCAGACCGACAATGACGCTCATGCACGAAAGTGCGGGGATCGAAAAGGATTAGAAACCCCAGTAGTCCGGC 50_5 M01032:110:000000000-A5D3N:1:1101:16649:1588 1:N:0:0 orig_bc=TAGGAACTGGCC new_bc=TAGGAACTGGCC bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGAAGGTCAAGTCAGCTGTGGAATGTAGTCGCTCAACGTCTGCACTGCAGTTGAAACTGGCCTCCTTGAGTGCGTAAGAGGCAGGCGGAATTCGTCGTGTAGCGGTGAAATGCT TAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGGCCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCAAACAGGATTAGATACCCGGGTAGTCCGG 83_6 M01032:110:000000000-A5D3N:1:1101:17160:1588 1:N:0:0 orig_bc=ACATTCAGCGCA new_bc=ACATTCAGCGCA bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGATGAGTAAGTCAGCGGTGAAATACCCCAGCTCAACTGGGGGGCTGCCGTTGATACTGCTTATCTAGAGTGCGAACGGCGCCGGCGGAATGTGTCATGTAGCGGTGAAATGCTTAGAGATGACACAGAAACCCGATCGCGAAGGCAGCCGGCGAGCACGACACTGACGCTGAGGCACGAAGGTGCGGGGATCAAACAGGATTAGATACCCGTGAAGTCCGG 64_7 M01032:110:000000000-A5D3N:1:1101:16520:1589 1:N:0:0 orig_bc=GAATCTTCGAGC new_bc=GAATCTTCGAGC bc_diffs=0 TACGTAGGTTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGGCCTTTAAGTCAGTGGTCAAAGCGTGTGGCTCAACCCTACCACGCCGTTGATACTGGAGGCCTTGAGTGCACATAAGGATGGTGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTGTCCGGGGCGTAACTGACGCTAATGCTCGAAAGTGCGGGTATCAAACAGGATTAGATACCCCAGTAGTCCGGC 20_8 M01032:110:000000000-A5D3N:1:1101:15950:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAAGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCATGCTAAGTCTGCCGTCAAATGGCAGGGCTCAACCCTGTCTTGCGGTGGAAACTGATGGGCTTGAGTACACTCGAGGCAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTGCCTGGTGTGCGATTGACGCTGAGGCTCGAAGGTGCGGGAATCAAACAGGATTAGATACCCGAGTAGTCCGGC 40_9 M01032:110:000000000-A5D3N:1:1101:15928:1589 1:N:0:0 orig_bc=AAGAGATGTCGA new_bc=AAGAGATGTCGA bc_diffs=0 TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGCCTGGCAAGCCAGAGGTGAAAACCCGGGGCTCAACCCCGTGATTGCCTTTGGAACTGTTAGGCTTGAGTACTGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACAGAAACTGACGCTGGGGCTCGAAAGCGTGGGGGGCAAACAGGATTAGTTACCCCGGTAGCCGGG 20_10 M01032:110:000000000-A5D3N:1:1101:14491:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCTTTTTAAGTCAGTGGTTAAATCGTGACGCTCAACGTCATCACGCCATTGATACTGGAGAGCTTGATTGCGGTCGAGGTTTGCGGAATTCGTTGTGTAGCGGTGAAATGCATAGATATGACGAAGAACACCGATTGCGTAGGCAGCAGACCAGGCCGTAAATGACGCTCATGCACGAAAGTGCGGGGATCGAACAGGATTAGATACCCGGGTAGTCCGGC

On Jun 16, 2015, at 11:11 AM, Robert Leach notifications@github.com<mailto:notifications@github.com> wrote:

Hi Koji,

Thanks for using CFF. Let me explain how getReals.pl works, but first, let me back up to the previous step....

getCandidates.pl selects sequences it thinks are real by comparing each sequence with the abundance you would expect to get if it was an erroneous sequence (as compared to neighboring sequences that differ only by 1 base). It uses the estimated error rate of each nucleotide change to predict the likelihood that its neighbors were misread during PCR to produce it. If the sequence's actual abundance is above the erroneous predicted abundance by a given threshold, it is assumed to be real.

Now getReals.pl takes it a step further. getReals.pl looks across multiple samples (e.g. a time series) and considers all the candidates from different sample files and if it sees the same candidate sequence a threshold number of times, it assumes the sequence is real. It also does filtering for chimeras.

So, running getReals.pl requires multiple sample files (at least 2) in order to run. Essentially, your results with a single sample file are the candidate files.

If you wish to perform chimera filtering on a single sample file, currently, you would have to run uchime on your own. getReals.pl runs ucime in 2 separate calls:

usearch -usearch_global $tmp_lib -db $gcand_file -id 1.0 -idprefix $seqlen -matched $cands_with_global_abund -strand plus -quiet

usearch -uchime_denovo $cands_with_global_abund -minuniquesize 2 -nonchimeras $tmp_out_file $aln_arg -quiet

where:

$tmp_lib = a temporary library file output in the first step. $gcand_file = your candidates file. $seqlen = the length of your sequences. $cands_with_global_abund = a temporary candidate file output in the first step. $tmp_out_file = your chimera output file. $aln_arg = your chimera alignment output file.

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 16, 2015, at 7:05 AM, kojiyasuda notifications@github.com<mailto:notifications@github.com> wrote:

It looks like my run was terminated at "getReals" step and I am not sure how to fix this. Does this error make sense to you? I placed my input fna file containing 16S V4, Illumina stitched 250nt amplicons into Caporaso_FSTA folder(after removing the original examples) since the example1 ran nicely and I didn't know how to run a new command. Hence it says Caporaso_FASTA, but it actually contained my fna file that I renamed to one of the example fna files. Thanks for looking into this! Koji

/Users/kojiyasuda/Desktop/CFF-master/samples/run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Jun 15 18:57:44 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out mergeSeqs.pl 'Caporaso_FASTA/L6S2?_19???.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p '' -- 1989 seconds neighbors.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -o .nbrs -- 995 seconds errorRates.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -z 2 -o .erates -- 262 seconds nZeros.pl 'Caporaso_FASTA_out/1_lib/{L6S2?_19???.fna}.lib' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -r 'Caporaso_FASTA_out/1_lib/global_library.fna.erates' -o .n0s --outdir 'Caporaso_FASTA_out/2_n0s' -- 348 seconds getCandidates.pl 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -o .cands -h 10 --outdir 'Caporaso_FASTA_out/3_cands' -- 256 seconds getReals.pl -i 'Caporaso_FASTA_out/3_cands/{L6S2?_19???.fna}.lib.n0s.cands' -n 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -f 'Caporaso_FASTA_out/1_lib/global_library.fna' -k 2 --outdir 'Caporaso_FASTA_out/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112465264.

hepcat72 commented 9 years ago

This perl 1-liner should split your file for you:

perl -e 'open(IN,$ARGV[0]);while(){if(/^>(\d+)_/){if(defined($of)){close(OUT)}$of="$ARGV[0].$1";open(OUT,">>$of");select(OUT);}print}close(IN);close(OUT);' your_file

Replace "your_file" with the path/name of your file. Ourput files will end in ".##" where ## is your sample number.

You must copy & paste this onto the command line as-is. If your email program has interrupted the single line with hard returns, you must repair it.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 3:43 PM, kojiyasuda notifications@github.com wrote:

Hi Rob,

Thank you so much for explaining the steps. It looks like we would have to split the files by samples, since the one input file used contain all our sequences. We will try splitting the files by samples (in fna format) and will try it again. Or do you have a flag to tell the script to calculate between samples (i.e “54” from “44” see below) from one file?

Thank you again for your kind help. Koji

54_0 M01032:110:000000000-A5D3N:1:1101:14044:1587 1:N:0:0 orig_bc=GAGGCTCATCAT new_bc=GAGGCTCATCAT bc_diffs=0 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTAATAAGTCTGAAGTTAAAGGCAGTGGCTTAACCATTGTTCGCTTTGGAAACTGTTAGACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGT AGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCTGTAGTCCGGC 44_1 M01032:110:000000000-A5D3N:1:1101:16826:1587 1:N:0:0 orig_bc=AAGGAGCGCCTT new_bc=AAGGAGCGCCTT bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCTGACAAGTCAGCGGTGAAATGTCCACGCTCAACGTGGAAAGTGCCGTTGAAACTGCCGGACTAGAATTCGGATGCCGTGGGAGGAATGTGTAGTGTAGCGGTGAAATGCT TAGATATTACACAGAACACCGATTGCGAAGGCATCTCACGAATCCGACATTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGGATTAGATACCCCTGTAGTCCGG 57_2 M01032:110:000000000-A5D3N:1:1101:16915:1587 1:N:0:0 orig_bc=AGTAGAGGGATG new_bc=AGTAGAGGGATG bc_diffs=0 TACGGAAGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCGGTCAAATGTCAGGGCCCAACCTTGGCCTGCCGTTGATACTGGTAGTCTTGAATACACACAAGGAAGATGGAATTCGTCGTGTAGCGGTGAGATGCTT AGATATGACGAAGAACTCCGATTGCGAAGGCAGTCTTCTGGGGTGCGATTGACGCTGAGGCTCGAAAGTGCGGGAATCAAACAGGATTAGAAACCCCAGTAGTCCGGC 14_3 M01032:110:000000000-A5D3N:1:1101:15513:1588 1:N:0:0 orig_bc=CCTCGTTCGACT new_bc=CCTCGTTCGACT bc_diffs=0 TTAGATACCCTAGTAGTCCGGCTGACTGACTCCTCGTTCGACTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACCGGATCCTAACTCCGGAACTGCCGATGATACAGATGTGCTGGAATACAGATGCCGTGGGATCAATTAGTAGTGTATCGGTGAAACACA TAGATATTACTCAGAACACCGATTGCGAAGTCATCTCACGAAGCAGGTATTGACGCTGATGCACGAAAACGTGGGGATCAAACAACACTAGAAACCCCAGAATGCCGG 20_4 M01032:110:000000000-A5D3N:1:1101:15485:1588 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCGCTTTAAGTCAGTGGTCAAATCGTGAGGCTCAACCTCATCCCGCCATTGATACTGGAGCGCTTGATTGCGGTTGAGGTAGGCGGAATTCGTCGTGTAGCGGTGAAATGCAT AGATATGACGAAGAACCCCGATTGCGTAGGCAGCTTACCAGACCGACAATGACGCTCATGCACGAAAGTGCGGGGATCGAAAAGGATTAGAAACCCCAGTAGTCCGGC 50_5 M01032:110:000000000-A5D3N:1:1101:16649:1588 1:N:0:0 orig_bc=TAGGAACTGGCC new_bc=TAGGAACTGGCC bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGAAGGTCAAGTCAGCTGTGGAATGTAGTCGCTCAACGTCTGCACTGCAGTTGAAACTGGCCTCCTTGAGTGCGTAAGAGGCAGGCGGAATTCGTCGTGTAGCGGTGAAATGCT TAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGGCCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCAAACAGGATTAGATACCCGGGTAGTCCGG 83_6 M01032:110:000000000-A5D3N:1:1101:17160:1588 1:N:0:0 orig_bc=ACATTCAGCGCA new_bc=ACATTCAGCGCA bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGATGAGTAAGTCAGCGGTGAAATACCCCAGCTCAACTGGGGGGCTGCCGTTGATACTGCTTATCTAGAGTGCGAACGGCGCCGGCGGAATGTGTCATGTAGCGGTGAAATGCTTAGAGATGACACAGAAACCCGATCGCGAAGGCAGCCGGCGAGCACGACACTGACGCTGAGGCACGAAGGTGCGGGGATCAAACAGGATTAGATACCCGTGAAGTCCGG 64_7 M01032:110:000000000-A5D3N:1:1101:16520:1589 1:N:0:0 orig_bc=GAATCTTCGAGC new_bc=GAATCTTCGAGC bc_diffs=0 TACGTAGGTTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGGCCTTTAAGTCAGTGGTCAAAGCGTGTGGCTCAACCCTACCACGCCGTTGATACTGGAGGCCTTGAGTGCACATAAGGATGGTGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTGTCCGGGGCGTAACTGACGCTAATGCTCGAAAGTGCGGGTATCAAACAGGATTAGATACCCCAGTAGTCCGGC 20_8 M01032:110:000000000-A5D3N:1:1101:15950:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAAGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCATGCTAAGTCTGCCGTCAAATGGCAGGGCTCAACCCTGTCTTGCGGTGGAAACTGATGGGCTTGAGTACACTCGAGGCAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTGCCTGGTGTGCGATTGACGCTGAGGCTCGAAGGTGCGGGAATCAAACAGGATTAGATACCCGAGTAGTCCGGC 40_9 M01032:110:000000000-A5D3N:1:1101:15928:1589 1:N:0:0 orig_bc=AAGAGATGTCGA new_bc=AAGAGATGTCGA bc_diffs=0 TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGCCTGGCAAGCCAGAGGTGAAAACCCGGGGCTCAACCCCGTGATTGCCTTTGGAACTGTTAGGCTTGAGTACTGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACAGAAACTGACGCTGGGGCTCGAAAGCGTGGGGGGCAAACAGGATTAGTTACCCCGGTAGCCGGG 20_10 M01032:110:000000000-A5D3N:1:1101:14491:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCTTTTTAAGTCAGTGGTTAAATCGTGACGCTCAACGTCATCACGCCATTGATACTGGAGAGCTTGATTGCGGTCGAGGTTTGCGGAATTCGTTGTGTAGCGGTGAAATGCATAGATATGACGAAGAACACCGATTGCGTAGGCAGCAGACCAGGCCGTAAATGACGCTCATGCACGAAAGTGCGGGGATCGAACAGGATTAGATACCCGGGTAGTCCGGC

On Jun 16, 2015, at 11:11 AM, Robert Leach notifications@github.com<mailto:notifications@github.com> wrote:

Hi Koji,

Thanks for using CFF. Let me explain how getReals.pl works, but first, let me back up to the previous step....

getCandidates.pl selects sequences it thinks are real by comparing each sequence with the abundance you would expect to get if it was an erroneous sequence (as compared to neighboring sequences that differ only by 1 base). It uses the estimated error rate of each nucleotide change to predict the likelihood that its neighbors were misread during PCR to produce it. If the sequence's actual abundance is above the erroneous predicted abundance by a given threshold, it is assumed to be real.

Now getReals.pl takes it a step further. getReals.pl looks across multiple samples (e.g. a time series) and considers all the candidates from different sample files and if it sees the same candidate sequence a threshold number of times, it assumes the sequence is real. It also does filtering for chimeras.

So, running getReals.pl requires multiple sample files (at least 2) in order to run. Essentially, your results with a single sample file are the candidate files.

If you wish to perform chimera filtering on a single sample file, currently, you would have to run uchime on your own. getReals.pl runs ucime in 2 separate calls:

usearch -usearch_global $tmp_lib -db $gcand_file -id 1.0 -idprefix $seqlen -matched $cands_with_global_abund -strand plus -quiet

usearch -uchime_denovo $cands_with_global_abund -minuniquesize 2 -nonchimeras $tmp_out_file $aln_arg -quiet

where:

$tmp_lib = a temporary library file output in the first step. $gcand_file = your candidates file. $seqlen = the length of your sequences. $cands_with_global_abund = a temporary candidate file output in the first step. $tmp_out_file = your chimera output file. $aln_arg = your chimera alignment output file.

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 16, 2015, at 7:05 AM, kojiyasuda notifications@github.com<mailto:notifications@github.com> wrote:

It looks like my run was terminated at "getReals" step and I am not sure how to fix this. Does this error make sense to you? I placed my input fna file containing 16S V4, Illumina stitched 250nt amplicons into Caporaso_FSTA folder(after removing the original examples) since the example1 ran nicely and I didn't know how to run a new command. Hence it says Caporaso_FASTA, but it actually contained my fna file that I renamed to one of the example fna files. Thanks for looking into this! Koji

/Users/kojiyasuda/Desktop/CFF-master/samples/run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Jun 15 18:57:44 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out mergeSeqs.pl 'Caporaso_FASTA/L6S2?_19???.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p '' -- 1989 seconds neighbors.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -o .nbrs -- 995 seconds errorRates.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -z 2 -o .erates -- 262 seconds nZeros.pl 'Caporaso_FASTA_out/1_lib/{L6S2?_19???.fna}.lib' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -r 'Caporaso_FASTA_out/1_lib/global_library.fna.erates' -o .n0s --outdir 'Caporaso_FASTA_out/2_n0s' -- 348 seconds getCandidates.pl 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -o .cands -h 10 --outdir 'Caporaso_FASTA_out/3_cands' -- 256 seconds getReals.pl -i 'Caporaso_FASTA_out/3_cands/{L6S2?_19???.fna}.lib.n0s.cands' -n 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -f 'Caporaso_FASTA_out/1_lib/global_library.fna' -k 2 --outdir 'Caporaso_FASTA_out/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112465264.

— Reply to this email directly or view it on GitHub.

kojiyasuda commented 9 years ago

wow, thank you very much. the script worked like a magic. The sample number was placed after “fna”, which we probably need to fix before running CFF. One thing I noticed is that this file only contained 1/2 of the data. I am going to look for the other half of the data and then will run CFF.

L6S21_19130.fna.44

Thank you very much again for your prompt help. I will keep you posted if the run goes nicely! Koji

On Jun 17, 2015, at 3:57 PM, Robert Leach notifications@github.com<mailto:notifications@github.com> wrote:

This perl 1-liner should split your file for you:

perl -e 'open(IN,$ARGV[0]);while(){if(/^>(\d+)_/){if(defined($of)){close(OUT)}$of="$ARGV[0].$1";open(OUT,">>$of");select(OUT);}print}close(IN);close(OUT);' your_file

Replace "your_file" with the path/name of your file. Ourput files will end in ".##" where ## is your sample number.

You must copy & paste this onto the command line as-is. If your email program has interrupted the single line with hard returns, you must repair it.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 3:43 PM, kojiyasuda notifications@github.com<mailto:notifications@github.com> wrote:

Hi Rob,

Thank you so much for explaining the steps. It looks like we would have to split the files by samples, since the one input file used contain all our sequences. We will try splitting the files by samples (in fna format) and will try it again. Or do you have a flag to tell the script to calculate between samples (i.e “54” from “44” see below) from one file?

Thank you again for your kind help. Koji

54_0 M01032:110:000000000-A5D3N:1:1101:14044:1587 1:N:0:0 orig_bc=GAGGCTCATCAT new_bc=GAGGCTCATCAT bc_diffs=0 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTAATAAGTCTGAAGTTAAAGGCAGTGGCTTAACCATTGTTCGCTTTGGAAACTGTTAGACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGT AGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCTGTAGTCCGGC 44_1 M01032:110:000000000-A5D3N:1:1101:16826:1587 1:N:0:0 orig_bc=AAGGAGCGCCTT new_bc=AAGGAGCGCCTT bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCTGACAAGTCAGCGGTGAAATGTCCACGCTCAACGTGGAAAGTGCCGTTGAAACTGCCGGACTAGAATTCGGATGCCGTGGGAGGAATGTGTAGTGTAGCGGTGAAATGCT TAGATATTACACAGAACACCGATTGCGAAGGCATCTCACGAATCCGACATTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGGATTAGATACCCCTGTAGTCCGG 57_2 M01032:110:000000000-A5D3N:1:1101:16915:1587 1:N:0:0 orig_bc=AGTAGAGGGATG new_bc=AGTAGAGGGATG bc_diffs=0 TACGGAAGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCGGTCAAATGTCAGGGCCCAACCTTGGCCTGCCGTTGATACTGGTAGTCTTGAATACACACAAGGAAGATGGAATTCGTCGTGTAGCGGTGAGATGCTT AGATATGACGAAGAACTCCGATTGCGAAGGCAGTCTTCTGGGGTGCGATTGACGCTGAGGCTCGAAAGTGCGGGAATCAAACAGGATTAGAAACCCCAGTAGTCCGGC 14_3 M01032:110:000000000-A5D3N:1:1101:15513:1588 1:N:0:0 orig_bc=CCTCGTTCGACT new_bc=CCTCGTTCGACT bc_diffs=0 TTAGATACCCTAGTAGTCCGGCTGACTGACTCCTCGTTCGACTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACCGGATCCTAACTCCGGAACTGCCGATGATACAGATGTGCTGGAATACAGATGCCGTGGGATCAATTAGTAGTGTATCGGTGAAACACA TAGATATTACTCAGAACACCGATTGCGAAGTCATCTCACGAAGCAGGTATTGACGCTGATGCACGAAAACGTGGGGATCAAACAACACTAGAAACCCCAGAATGCCGG 20_4 M01032:110:000000000-A5D3N:1:1101:15485:1588 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCGCTTTAAGTCAGTGGTCAAATCGTGAGGCTCAACCTCATCCCGCCATTGATACTGGAGCGCTTGATTGCGGTTGAGGTAGGCGGAATTCGTCGTGTAGCGGTGAAATGCAT AGATATGACGAAGAACCCCGATTGCGTAGGCAGCTTACCAGACCGACAATGACGCTCATGCACGAAAGTGCGGGGATCGAAAAGGATTAGAAACCCCAGTAGTCCGGC 50_5 M01032:110:000000000-A5D3N:1:1101:16649:1588 1:N:0:0 orig_bc=TAGGAACTGGCC new_bc=TAGGAACTGGCC bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGAAGGTCAAGTCAGCTGTGGAATGTAGTCGCTCAACGTCTGCACTGCAGTTGAAACTGGCCTCCTTGAGTGCGTAAGAGGCAGGCGGAATTCGTCGTGTAGCGGTGAAATGCT TAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGGCCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCAAACAGGATTAGATACCCGGGTAGTCCGG 83_6 M01032:110:000000000-A5D3N:1:1101:17160:1588 1:N:0:0 orig_bc=ACATTCAGCGCA new_bc=ACATTCAGCGCA bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGATGAGTAAGTCAGCGGTGAAATACCCCAGCTCAACTGGGGGGCTGCCGTTGATACTGCTTATCTAGAGTGCGAACGGCGCCGGCGGAATGTGTCATGTAGCGGTGAAATGCTTAGAGATGACACAGAAACCCGATCGCGAAGGCAGCCGGCGAGCACGACACTGACGCTGAGGCACGAAGGTGCGGGGATCAAACAGGATTAGATACCCGTGAAGTCCGG 64_7 M01032:110:000000000-A5D3N:1:1101:16520:1589 1:N:0:0 orig_bc=GAATCTTCGAGC new_bc=GAATCTTCGAGC bc_diffs=0 TACGTAGGTTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGGCCTTTAAGTCAGTGGTCAAAGCGTGTGGCTCAACCCTACCACGCCGTTGATACTGGAGGCCTTGAGTGCACATAAGGATGGTGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTGTCCGGGGCGTAACTGACGCTAATGCTCGAAAGTGCGGGTATCAAACAGGATTAGATACCCCAGTAGTCCGGC 20_8 M01032:110:000000000-A5D3N:1:1101:15950:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAAGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCATGCTAAGTCTGCCGTCAAATGGCAGGGCTCAACCCTGTCTTGCGGTGGAAACTGATGGGCTTGAGTACACTCGAGGCAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTGCCTGGTGTGCGATTGACGCTGAGGCTCGAAGGTGCGGGAATCAAACAGGATTAGATACCCGAGTAGTCCGGC 40_9 M01032:110:000000000-A5D3N:1:1101:15928:1589 1:N:0:0 orig_bc=AAGAGATGTCGA new_bc=AAGAGATGTCGA bc_diffs=0 TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGCCTGGCAAGCCAGAGGTGAAAACCCGGGGCTCAACCCCGTGATTGCCTTTGGAACTGTTAGGCTTGAGTACTGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACAGAAACTGACGCTGGGGCTCGAAAGCGTGGGGGGCAAACAGGATTAGTTACCCCGGTAGCCGGG 20_10 M01032:110:000000000-A5D3N:1:1101:14491:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCTTTTTAAGTCAGTGGTTAAATCGTGACGCTCAACGTCATCACGCCATTGATACTGGAGAGCTTGATTGCGGTCGAGGTTTGCGGAATTCGTTGTGTAGCGGTGAAATGCATAGATATGACGAAGAACACCGATTGCGTAGGCAGCAGACCAGGCCGTAAATGACGCTCATGCACGAAAGTGCGGGGATCGAACAGGATTAGATACCCGGGTAGTCCGGC

On Jun 16, 2015, at 11:11 AM, Robert Leach notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Koji,

Thanks for using CFF. Let me explain how getReals.pl works, but first, let me back up to the previous step....

getCandidates.pl selects sequences it thinks are real by comparing each sequence with the abundance you would expect to get if it was an erroneous sequence (as compared to neighboring sequences that differ only by 1 base). It uses the estimated error rate of each nucleotide change to predict the likelihood that its neighbors were misread during PCR to produce it. If the sequence's actual abundance is above the erroneous predicted abundance by a given threshold, it is assumed to be real.

Now getReals.pl takes it a step further. getReals.pl looks across multiple samples (e.g. a time series) and considers all the candidates from different sample files and if it sees the same candidate sequence a threshold number of times, it assumes the sequence is real. It also does filtering for chimeras.

So, running getReals.pl requires multiple sample files (at least 2) in order to run. Essentially, your results with a single sample file are the candidate files.

If you wish to perform chimera filtering on a single sample file, currently, you would have to run uchime on your own. getReals.pl runs ucime in 2 separate calls:

usearch -usearch_global $tmp_lib -db $gcand_file -id 1.0 -idprefix $seqlen -matched $cands_with_global_abund -strand plus -quiet

usearch -uchime_denovo $cands_with_global_abund -minuniquesize 2 -nonchimeras $tmp_out_file $aln_arg -quiet

where:

$tmp_lib = a temporary library file output in the first step. $gcand_file = your candidates file. $seqlen = the length of your sequences. $cands_with_global_abund = a temporary candidate file output in the first step. $tmp_out_file = your chimera output file. $aln_arg = your chimera alignment output file.

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 16, 2015, at 7:05 AM, kojiyasuda notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

It looks like my run was terminated at "getReals" step and I am not sure how to fix this. Does this error make sense to you? I placed my input fna file containing 16S V4, Illumina stitched 250nt amplicons into Caporaso_FSTA folder(after removing the original examples) since the example1 ran nicely and I didn't know how to run a new command. Hence it says Caporaso_FASTA, but it actually contained my fna file that I renamed to one of the example fna files. Thanks for looking into this! Koji

/Users/kojiyasuda/Desktop/CFF-master/samples/run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Jun 15 18:57:44 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out mergeSeqs.pl 'Caporaso_FASTA/L6S2?_19???.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p '' -- 1989 seconds neighbors.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -o .nbrs -- 995 seconds errorRates.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -z 2 -o .erates -- 262 seconds nZeros.pl 'Caporaso_FASTA_out/1_lib/{L6S2?_19???.fna}.lib' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -r 'Caporaso_FASTA_out/1_lib/global_library.fna.erates' -o .n0s --outdir 'Caporaso_FASTA_out/2_n0s' -- 348 seconds getCandidates.pl 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -o .cands -h 10 --outdir 'Caporaso_FASTA_out/3_cands' -- 256 seconds getReals.pl -i 'Caporaso_FASTA_out/3_cands/{L6S2?_19???.fna}.lib.n0s.cands' -n 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -f 'Caporaso_FASTA_out/1_lib/global_library.fna' -k 2 --outdir 'Caporaso_FASTA_out/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112465264.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112928135.

hepcat72 commented 9 years ago

BTW, a good sanity check to make sure the 1-liner worked correctly is to make sure these produce the same number:

wc -l your_file

cat your_file.[0-9]* | wc -l

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 5:24 PM, kojiyasuda notifications@github.com wrote:

wow, thank you very much. the script worked like a magic. The sample number was placed after “fna”, which we probably need to fix before running CFF. One thing I noticed is that this file only contained 1/2 of the data. I am going to look for the other half of the data and then will run CFF.

L6S21_19130.fna.44

Thank you very much again for your prompt help. I will keep you posted if the run goes nicely! Koji

On Jun 17, 2015, at 3:57 PM, Robert Leach notifications@github.com<mailto:notifications@github.com> wrote:

This perl 1-liner should split your file for you:

perl -e 'open(IN,$ARGV[0]);while(){if(/^>(\d+)_/){if(defined($of)){close(OUT)}$of="$ARGV[0].$1";open(OUT,">>$of");select(OUT);}print}close(IN);close(OUT);' your_file

Replace "your_file" with the path/name of your file. Ourput files will end in ".##" where ## is your sample number.

You must copy & paste this onto the command line as-is. If your email program has interrupted the single line with hard returns, you must repair it.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 3:43 PM, kojiyasuda notifications@github.com<mailto:notifications@github.com> wrote:

Hi Rob,

Thank you so much for explaining the steps. It looks like we would have to split the files by samples, since the one input file used contain all our sequences. We will try splitting the files by samples (in fna format) and will try it again. Or do you have a flag to tell the script to calculate between samples (i.e “54” from “44” see below) from one file?

Thank you again for your kind help. Koji

54_0 M01032:110:000000000-A5D3N:1:1101:14044:1587 1:N:0:0 orig_bc=GAGGCTCATCAT new_bc=GAGGCTCATCAT bc_diffs=0 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTAATAAGTCTGAAGTTAAAGGCAGTGGCTTAACCATTGTTCGCTTTGGAAACTGTTAGACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGT AGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCTGTAGTCCGGC 44_1 M01032:110:000000000-A5D3N:1:1101:16826:1587 1:N:0:0 orig_bc=AAGGAGCGCCTT new_bc=AAGGAGCGCCTT bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCTGACAAGTCAGCGGTGAAATGTCCACGCTCAACGTGGAAAGTGCCGTTGAAACTGCCGGACTAGAATTCGGATGCCGTGGGAGGAATGTGTAGTGTAGCGGTGAAATGCT TAGATATTACACAGAACACCGATTGCGAAGGCATCTCACGAATCCGACATTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGGATTAGATACCCCTGTAGTCCGG 57_2 M01032:110:000000000-A5D3N:1:1101:16915:1587 1:N:0:0 orig_bc=AGTAGAGGGATG new_bc=AGTAGAGGGATG bc_diffs=0 TACGGAAGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCGGTCAAATGTCAGGGCCCAACCTTGGCCTGCCGTTGATACTGGTAGTCTTGAATACACACAAGGAAGATGGAATTCGTCGTGTAGCGGTGAGATGCTT AGATATGACGAAGAACTCCGATTGCGAAGGCAGTCTTCTGGGGTGCGATTGACGCTGAGGCTCGAAAGTGCGGGAATCAAACAGGATTAGAAACCCCAGTAGTCCGGC 14_3 M01032:110:000000000-A5D3N:1:1101:15513:1588 1:N:0:0 orig_bc=CCTCGTTCGACT new_bc=CCTCGTTCGACT bc_diffs=0 TTAGATACCCTAGTAGTCCGGCTGACTGACTCCTCGTTCGACTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACCGGATCCTAACTCCGGAACTGCCGATGATACAGATGTGCTGGAATACAGATGCCGTGGGATCAATTAGTAGTGTATCGGTGAAACACA TAGATATTACTCAGAACACCGATTGCGAAGTCATCTCACGAAGCAGGTATTGACGCTGATGCACGAAAACGTGGGGATCAAACAACACTAGAAACCCCAGAATGCCGG 20_4 M01032:110:000000000-A5D3N:1:1101:15485:1588 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCGCTTTAAGTCAGTGGTCAAATCGTGAGGCTCAACCTCATCCCGCCATTGATACTGGAGCGCTTGATTGCGGTTGAGGTAGGCGGAATTCGTCGTGTAGCGGTGAAATGCAT AGATATGACGAAGAACCCCGATTGCGTAGGCAGCTTACCAGACCGACAATGACGCTCATGCACGAAAGTGCGGGGATCGAAAAGGATTAGAAACCCCAGTAGTCCGGC 50_5 M01032:110:000000000-A5D3N:1:1101:16649:1588 1:N:0:0 orig_bc=TAGGAACTGGCC new_bc=TAGGAACTGGCC bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGAAGGTCAAGTCAGCTGTGGAATGTAGTCGCTCAACGTCTGCACTGCAGTTGAAACTGGCCTCCTTGAGTGCGTAAGAGGCAGGCGGAATTCGTCGTGTAGCGGTGAAATGCT TAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGGCCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCAAACAGGATTAGATACCCGGGTAGTCCGG 83_6 M01032:110:000000000-A5D3N:1:1101:17160:1588 1:N:0:0 orig_bc=ACATTCAGCGCA new_bc=ACATTCAGCGCA bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGATGAGTAAGTCAGCGGTGAAATACCCCAGCTCAACTGGGGGGCTGCCGTTGATACTGCTTATCTAGAGTGCGAACGGCGCCGGCGGAATGTGTCATGTAGCGGTGAAATGCTTAGAGATGACACAGAAACCCGATCGCGAAGGCAGCCGGCGAGCACGACACTGACGCTGAGGCACGAAGGTGCGGGGATCAAACAGGATTAGATACCCGTGAAGTCCGG 64_7 M01032:110:000000000-A5D3N:1:1101:16520:1589 1:N:0:0 orig_bc=GAATCTTCGAGC new_bc=GAATCTTCGAGC bc_diffs=0 TACGTAGGTTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGGCCTTTAAGTCAGTGGTCAAAGCGTGTGGCTCAACCCTACCACGCCGTTGATACTGGAGGCCTTGAGTGCACATAAGGATGGTGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTGTCCGGGGCGTAACTGACGCTAATGCTCGAAAGTGCGGGTATCAAACAGGATTAGATACCCCAGTAGTCCGGC 20_8 M01032:110:000000000-A5D3N:1:1101:15950:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAAGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCATGCTAAGTCTGCCGTCAAATGGCAGGGCTCAACCCTGTCTTGCGGTGGAAACTGATGGGCTTGAGTACACTCGAGGCAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTGCCTGGTGTGCGATTGACGCTGAGGCTCGAAGGTGCGGGAATCAAACAGGATTAGATACCCGAGTAGTCCGGC 40_9 M01032:110:000000000-A5D3N:1:1101:15928:1589 1:N:0:0 orig_bc=AAGAGATGTCGA new_bc=AAGAGATGTCGA bc_diffs=0 TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGCCTGGCAAGCCAGAGGTGAAAACCCGGGGCTCAACCCCGTGATTGCCTTTGGAACTGTTAGGCTTGAGTACTGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACAGAAACTGACGCTGGGGCTCGAAAGCGTGGGGGGCAAACAGGATTAGTTACCCCGGTAGCCGGG 20_10 M01032:110:000000000-A5D3N:1:1101:14491:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCTTTTTAAGTCAGTGGTTAAATCGTGACGCTCAACGTCATCACGCCATTGATACTGGAGAGCTTGATTGCGGTCGAGGTTTGCGGAATTCGTTGTGTAGCGGTGAAATGCATAGATATGACGAAGAACACCGATTGCGTAGGCAGCAGACCAGGCCGTAAATGACGCTCATGCACGAAAGTGCGGGGATCGAACAGGATTAGATACCCGGGTAGTCCGGC

On Jun 16, 2015, at 11:11 AM, Robert Leach notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Koji,

Thanks for using CFF. Let me explain how getReals.pl works, but first, let me back up to the previous step....

getCandidates.pl selects sequences it thinks are real by comparing each sequence with the abundance you would expect to get if it was an erroneous sequence (as compared to neighboring sequences that differ only by 1 base). It uses the estimated error rate of each nucleotide change to predict the likelihood that its neighbors were misread during PCR to produce it. If the sequence's actual abundance is above the erroneous predicted abundance by a given threshold, it is assumed to be real.

Now getReals.pl takes it a step further. getReals.pl looks across multiple samples (e.g. a time series) and considers all the candidates from different sample files and if it sees the same candidate sequence a threshold number of times, it assumes the sequence is real. It also does filtering for chimeras.

So, running getReals.pl requires multiple sample files (at least 2) in order to run. Essentially, your results with a single sample file are the candidate files.

If you wish to perform chimera filtering on a single sample file, currently, you would have to run uchime on your own. getReals.pl runs ucime in 2 separate calls:

usearch -usearch_global $tmp_lib -db $gcand_file -id 1.0 -idprefix $seqlen -matched $cands_with_global_abund -strand plus -quiet

usearch -uchime_denovo $cands_with_global_abund -minuniquesize 2 -nonchimeras $tmp_out_file $aln_arg -quiet

where:

$tmp_lib = a temporary library file output in the first step. $gcand_file = your candidates file. $seqlen = the length of your sequences. $cands_with_global_abund = a temporary candidate file output in the first step. $tmp_out_file = your chimera output file. $aln_arg = your chimera alignment output file.

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 16, 2015, at 7:05 AM, kojiyasuda notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

It looks like my run was terminated at "getReals" step and I am not sure how to fix this. Does this error make sense to you? I placed my input fna file containing 16S V4, Illumina stitched 250nt amplicons into Caporaso_FSTA folder(after removing the original examples) since the example1 ran nicely and I didn't know how to run a new command. Hence it says Caporaso_FASTA, but it actually contained my fna file that I renamed to one of the example fna files. Thanks for looking into this! Koji

/Users/kojiyasuda/Desktop/CFF-master/samples/run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Jun 15 18:57:44 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out mergeSeqs.pl 'Caporaso_FASTA/L6S2?_19???.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p '' -- 1989 seconds neighbors.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -o .nbrs -- 995 seconds errorRates.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -z 2 -o .erates -- 262 seconds nZeros.pl 'Caporaso_FASTA_out/1_lib/{L6S2?_19???.fna}.lib' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -r 'Caporaso_FASTA_out/1_lib/global_library.fna.erates' -o .n0s --outdir 'Caporaso_FASTA_out/2_n0s' -- 348 seconds getCandidates.pl 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -o .cands -h 10 --outdir 'Caporaso_FASTA_out/3_cands' -- 256 seconds getReals.pl -i 'Caporaso_FASTA_out/3_cands/{L6S2?_19???.fna}.lib.n0s.cands' -n 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -f 'Caporaso_FASTA_out/1_lib/global_library.fna' -k 2 --outdir 'Caporaso_FASTA_out/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112465264.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112928135.

— Reply to this email directly or view it on GitHub.

kojiyasuda commented 9 years ago

Thank you again, Rob. I don’t think I am doing the second part right…

Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ wc -l Horse_275trimmed_file.fna 26486312 Horse_275trimmed_file.fna Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmed_file.fna.[0-9]* | wc -l cat: Horse_275trimmedfile.fna.[0-9]: No such file or directory 0 Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmedfile.[0-9] | wc -l cat: Horse_275trimmedfile.[0-9]: No such file or directory 0 Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmedfile.fna.[0-9] | wc -l cat: Horse_275trimmed_file.fna.[0-9]*: No such file or directory

On Jun 17, 2015, at 5:34 PM, Robert Leach notifications@github.com<mailto:notifications@github.com> wrote:

BTW, a good sanity check to make sure the 1-liner worked correctly is to make sure these produce the same number:

wc -l your_file

cat your_file.[0-9]* | wc -l

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 5:24 PM, kojiyasuda notifications@github.com<mailto:notifications@github.com> wrote:

wow, thank you very much. the script worked like a magic. The sample number was placed after “fna”, which we probably need to fix before running CFF. One thing I noticed is that this file only contained 1/2 of the data. I am going to look for the other half of the data and then will run CFF.

L6S21_19130.fna.44

Thank you very much again for your prompt help. I will keep you posted if the run goes nicely! Koji

On Jun 17, 2015, at 3:57 PM, Robert Leach notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

This perl 1-liner should split your file for you:

perl -e 'open(IN,$ARGV[0]);while(){if(/^>(\d+)_/){if(defined($of)){close(OUT)}$of="$ARGV[0].$1";open(OUT,">>$of");select(OUT);}print}close(IN);close(OUT);' your_file

Replace "your_file" with the path/name of your file. Ourput files will end in ".##" where ## is your sample number.

You must copy & paste this onto the command line as-is. If your email program has interrupted the single line with hard returns, you must repair it.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 3:43 PM, kojiyasuda notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Rob,

Thank you so much for explaining the steps. It looks like we would have to split the files by samples, since the one input file used contain all our sequences. We will try splitting the files by samples (in fna format) and will try it again. Or do you have a flag to tell the script to calculate between samples (i.e “54” from “44” see below) from one file?

Thank you again for your kind help. Koji

54_0 M01032:110:000000000-A5D3N:1:1101:14044:1587 1:N:0:0 orig_bc=GAGGCTCATCAT new_bc=GAGGCTCATCAT bc_diffs=0 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTAATAAGTCTGAAGTTAAAGGCAGTGGCTTAACCATTGTTCGCTTTGGAAACTGTTAGACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGT AGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCTGTAGTCCGGC 44_1 M01032:110:000000000-A5D3N:1:1101:16826:1587 1:N:0:0 orig_bc=AAGGAGCGCCTT new_bc=AAGGAGCGCCTT bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCTGACAAGTCAGCGGTGAAATGTCCACGCTCAACGTGGAAAGTGCCGTTGAAACTGCCGGACTAGAATTCGGATGCCGTGGGAGGAATGTGTAGTGTAGCGGTGAAATGCT TAGATATTACACAGAACACCGATTGCGAAGGCATCTCACGAATCCGACATTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGGATTAGATACCCCTGTAGTCCGG 57_2 M01032:110:000000000-A5D3N:1:1101:16915:1587 1:N:0:0 orig_bc=AGTAGAGGGATG new_bc=AGTAGAGGGATG bc_diffs=0 TACGGAAGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCGGTCAAATGTCAGGGCCCAACCTTGGCCTGCCGTTGATACTGGTAGTCTTGAATACACACAAGGAAGATGGAATTCGTCGTGTAGCGGTGAGATGCTT AGATATGACGAAGAACTCCGATTGCGAAGGCAGTCTTCTGGGGTGCGATTGACGCTGAGGCTCGAAAGTGCGGGAATCAAACAGGATTAGAAACCCCAGTAGTCCGGC 14_3 M01032:110:000000000-A5D3N:1:1101:15513:1588 1:N:0:0 orig_bc=CCTCGTTCGACT new_bc=CCTCGTTCGACT bc_diffs=0 TTAGATACCCTAGTAGTCCGGCTGACTGACTCCTCGTTCGACTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACCGGATCCTAACTCCGGAACTGCCGATGATACAGATGTGCTGGAATACAGATGCCGTGGGATCAATTAGTAGTGTATCGGTGAAACACA TAGATATTACTCAGAACACCGATTGCGAAGTCATCTCACGAAGCAGGTATTGACGCTGATGCACGAAAACGTGGGGATCAAACAACACTAGAAACCCCAGAATGCCGG 20_4 M01032:110:000000000-A5D3N:1:1101:15485:1588 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCGCTTTAAGTCAGTGGTCAAATCGTGAGGCTCAACCTCATCCCGCCATTGATACTGGAGCGCTTGATTGCGGTTGAGGTAGGCGGAATTCGTCGTGTAGCGGTGAAATGCAT AGATATGACGAAGAACCCCGATTGCGTAGGCAGCTTACCAGACCGACAATGACGCTCATGCACGAAAGTGCGGGGATCGAAAAGGATTAGAAACCCCAGTAGTCCGGC 50_5 M01032:110:000000000-A5D3N:1:1101:16649:1588 1:N:0:0 orig_bc=TAGGAACTGGCC new_bc=TAGGAACTGGCC bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGAAGGTCAAGTCAGCTGTGGAATGTAGTCGCTCAACGTCTGCACTGCAGTTGAAACTGGCCTCCTTGAGTGCGTAAGAGGCAGGCGGAATTCGTCGTGTAGCGGTGAAATGCT TAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGGCCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCAAACAGGATTAGATACCCGGGTAGTCCGG 83_6 M01032:110:000000000-A5D3N:1:1101:17160:1588 1:N:0:0 orig_bc=ACATTCAGCGCA new_bc=ACATTCAGCGCA bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGATGAGTAAGTCAGCGGTGAAATACCCCAGCTCAACTGGGGGGCTGCCGTTGATACTGCTTATCTAGAGTGCGAACGGCGCCGGCGGAATGTGTCATGTAGCGGTGAAATGCTTAGAGATGACACAGAAACCCGATCGCGAAGGCAGCCGGCGAGCACGACACTGACGCTGAGGCACGAAGGTGCGGGGATCAAACAGGATTAGATACCCGTGAAGTCCGG 64_7 M01032:110:000000000-A5D3N:1:1101:16520:1589 1:N:0:0 orig_bc=GAATCTTCGAGC new_bc=GAATCTTCGAGC bc_diffs=0 TACGTAGGTTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGGCCTTTAAGTCAGTGGTCAAAGCGTGTGGCTCAACCCTACCACGCCGTTGATACTGGAGGCCTTGAGTGCACATAAGGATGGTGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTGTCCGGGGCGTAACTGACGCTAATGCTCGAAAGTGCGGGTATCAAACAGGATTAGATACCCCAGTAGTCCGGC 20_8 M01032:110:000000000-A5D3N:1:1101:15950:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAAGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCATGCTAAGTCTGCCGTCAAATGGCAGGGCTCAACCCTGTCTTGCGGTGGAAACTGATGGGCTTGAGTACACTCGAGGCAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTGCCTGGTGTGCGATTGACGCTGAGGCTCGAAGGTGCGGGAATCAAACAGGATTAGATACCCGAGTAGTCCGGC 40_9 M01032:110:000000000-A5D3N:1:1101:15928:1589 1:N:0:0 orig_bc=AAGAGATGTCGA new_bc=AAGAGATGTCGA bc_diffs=0 TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGCCTGGCAAGCCAGAGGTGAAAACCCGGGGCTCAACCCCGTGATTGCCTTTGGAACTGTTAGGCTTGAGTACTGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACAGAAACTGACGCTGGGGCTCGAAAGCGTGGGGGGCAAACAGGATTAGTTACCCCGGTAGCCGGG 20_10 M01032:110:000000000-A5D3N:1:1101:14491:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCTTTTTAAGTCAGTGGTTAAATCGTGACGCTCAACGTCATCACGCCATTGATACTGGAGAGCTTGATTGCGGTCGAGGTTTGCGGAATTCGTTGTGTAGCGGTGAAATGCATAGATATGACGAAGAACACCGATTGCGTAGGCAGCAGACCAGGCCGTAAATGACGCTCATGCACGAAAGTGCGGGGATCGAACAGGATTAGATACCCGGGTAGTCCGGC

On Jun 16, 2015, at 11:11 AM, Robert Leach notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Koji,

Thanks for using CFF. Let me explain how getReals.pl works, but first, let me back up to the previous step....

getCandidates.pl selects sequences it thinks are real by comparing each sequence with the abundance you would expect to get if it was an erroneous sequence (as compared to neighboring sequences that differ only by 1 base). It uses the estimated error rate of each nucleotide change to predict the likelihood that its neighbors were misread during PCR to produce it. If the sequence's actual abundance is above the erroneous predicted abundance by a given threshold, it is assumed to be real.

Now getReals.pl takes it a step further. getReals.pl looks across multiple samples (e.g. a time series) and considers all the candidates from different sample files and if it sees the same candidate sequence a threshold number of times, it assumes the sequence is real. It also does filtering for chimeras.

So, running getReals.pl requires multiple sample files (at least 2) in order to run. Essentially, your results with a single sample file are the candidate files.

If you wish to perform chimera filtering on a single sample file, currently, you would have to run uchime on your own. getReals.pl runs ucime in 2 separate calls:

usearch -usearch_global $tmp_lib -db $gcand_file -id 1.0 -idprefix $seqlen -matched $cands_with_global_abund -strand plus -quiet

usearch -uchime_denovo $cands_with_global_abund -minuniquesize 2 -nonchimeras $tmp_out_file $aln_arg -quiet

where:

$tmp_lib = a temporary library file output in the first step. $gcand_file = your candidates file. $seqlen = the length of your sequences. $cands_with_global_abund = a temporary candidate file output in the first step. $tmp_out_file = your chimera output file. $aln_arg = your chimera alignment output file.

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 16, 2015, at 7:05 AM, kojiyasuda notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

It looks like my run was terminated at "getReals" step and I am not sure how to fix this. Does this error make sense to you? I placed my input fna file containing 16S V4, Illumina stitched 250nt amplicons into Caporaso_FSTA folder(after removing the original examples) since the example1 ran nicely and I didn't know how to run a new command. Hence it says Caporaso_FASTA, but it actually contained my fna file that I renamed to one of the example fna files. Thanks for looking into this! Koji

/Users/kojiyasuda/Desktop/CFF-master/samples/run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Jun 15 18:57:44 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out mergeSeqs.pl 'Caporaso_FASTA/L6S2?_19???.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p '' -- 1989 seconds neighbors.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -o .nbrs -- 995 seconds errorRates.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -z 2 -o .erates -- 262 seconds nZeros.pl 'Caporaso_FASTA_out/1_lib/{L6S2?_19???.fna}.lib' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -r 'Caporaso_FASTA_out/1_lib/global_library.fna.erates' -o .n0s --outdir 'Caporaso_FASTA_out/2_n0s' -- 348 seconds getCandidates.pl 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -o .cands -h 10 --outdir 'Caporaso_FASTA_out/3_cands' -- 256 seconds getReals.pl -i 'Caporaso_FASTA_out/3_cands/{L6S2?_19???.fna}.lib.n0s.cands' -n 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -f 'Caporaso_FASTA_out/1_lib/global_library.fna' -k 2 --outdir 'Caporaso_FASTA_out/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112465264.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112928135.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112955998.

kojiyasuda commented 9 years ago

Hi Rob,

Thank you so much for your help yesterday. While trying to find the rest of horse files, I am now trying to run CFF on a different study set and having a trouble. 1) I’ve placed individual fna files into “Caporaso_FASTA” folder as below. 2) changes the “run_example1.tcsh” from "Caporaso_FASTQ/L6S2?.fastq” to “Caporaso_FASTA/*truncated.fna“, 3) ran this command line "tcsh run_example1.tcsh” and received an error (below) saying that it cannot recognize input files.

It must be something I changed in the “run_example1.tcsh” file or how my input fna files are formatted (example of fna files are also attached here).

Please let me know if you have any thoughts on this. Thank you so much! Koji


[cid:9BDD1CEE-8212-4A65-8F0A-CFC2BFF8929E@rc.fas.harvard.edu]


Koji-Yasudas-MacBook-Pro:samples kojiyasuda$ tcsh run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Thu Jun 18 09:39:59 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out

mergeSeqs.pl 'Caporaso_FASTA/*truncated.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p ''ERROR1: Unable to determine file type. Skipping file [Caporaso_FASTA/10oralEND_demuxed_truncated.fna]. ERROR2: No sequences found in input file [Caporaso_FASTA/10oralEND_demuxed_truncated.fna].


10StoolEND_CAACTCCCGTGA_137 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGTTGTTTAAGTCTGTTGTGAAAGTTT GCGGCTCAACCCTAATATTGCTGTTGATACTGGATATCTTGAGTGCAGTAGAGGCAGGCGGAATTCGTTGTGTAGCGGTG AAATGCGTAGATATCAGGAAGAACACCGATTGCGAAGGCAGCGTGCTGGGCTGCA 10StoolEND_CAACTCCCGTGA_312 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGATTTTTAAGTCAGCGGTCAAATCGT GGGGCTCAACCCCATCCAGCCGTTGAAACTGGGGATCTAGAGTGTGCGAGAGGTATGCGGAATGCGTGGTGTAGCGGTGA AATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATACCGGTGCACAA 10StoolEND_CAACTCCCGTGA_792 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGATTGTTAAGTCAGTTGTGAAAGTTT GCGGCTCAACCGTAAAATTGCAGTTGAAACTGGCAGTCTTGAGTACAGTAGAGGTGGGCGGAATTCGTGGTGTAGCGGTG AAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGTAGCTCACTGGACTGCA 10StoolEND_CAACTCCCGTGA_961 TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCC CCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTG AAATGCGTAGATATATGGAGGAACACCAGTGGCGGAGGCGGCTCTCTGGTCTGTA


On Jun 17, 2015, at 8:27 PM, Koji Yasuda koji_yasuda@hms.harvard.edu<mailto:koji_yasuda@hms.harvard.edu> wrote:

Thank you again, Rob. I don’t think I am doing the second part right…

Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ wc -l Horse_275trimmed_file.fna 26486312 Horse_275trimmed_file.fna Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmed_file.fna.[0-9]* | wc -l cat: Horse_275trimmedfile.fna.[0-9]: No such file or directory 0 Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmedfile.[0-9] | wc -l cat: Horse_275trimmedfile.[0-9]: No such file or directory 0 Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmedfile.fna.[0-9] | wc -l cat: Horse_275trimmed_file.fna.[0-9]*: No such file or directory

On Jun 17, 2015, at 5:34 PM, Robert Leach notifications@github.com<mailto:notifications@github.com> wrote:

BTW, a good sanity check to make sure the 1-liner worked correctly is to make sure these produce the same number:

wc -l your_file

cat your_file.[0-9]* | wc -l

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 5:24 PM, kojiyasuda notifications@github.com<mailto:notifications@github.com> wrote:

wow, thank you very much. the script worked like a magic. The sample number was placed after “fna”, which we probably need to fix before running CFF. One thing I noticed is that this file only contained 1/2 of the data. I am going to look for the other half of the data and then will run CFF.

L6S21_19130.fna.44

Thank you very much again for your prompt help. I will keep you posted if the run goes nicely! Koji

On Jun 17, 2015, at 3:57 PM, Robert Leach notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

This perl 1-liner should split your file for you:

perl -e 'open(IN,$ARGV[0]);while(){if(/^>(\d+)_/){if(defined($of)){close(OUT)}$of="$ARGV[0].$1";open(OUT,">>$of");select(OUT);}print}close(IN);close(OUT);' your_file

Replace "your_file" with the path/name of your file. Ourput files will end in ".##" where ## is your sample number.

You must copy & paste this onto the command line as-is. If your email program has interrupted the single line with hard returns, you must repair it.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 3:43 PM, kojiyasuda notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Rob,

Thank you so much for explaining the steps. It looks like we would have to split the files by samples, since the one input file used contain all our sequences. We will try splitting the files by samples (in fna format) and will try it again. Or do you have a flag to tell the script to calculate between samples (i.e “54” from “44” see below) from one file?

Thank you again for your kind help. Koji

54_0 M01032:110:000000000-A5D3N:1:1101:14044:1587 1:N:0:0 orig_bc=GAGGCTCATCAT new_bc=GAGGCTCATCAT bc_diffs=0 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTAATAAGTCTGAAGTTAAAGGCAGTGGCTTAACCATTGTTCGCTTTGGAAACTGTTAGACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGT AGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCTGTAGTCCGGC 44_1 M01032:110:000000000-A5D3N:1:1101:16826:1587 1:N:0:0 orig_bc=AAGGAGCGCCTT new_bc=AAGGAGCGCCTT bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCTGACAAGTCAGCGGTGAAATGTCCACGCTCAACGTGGAAAGTGCCGTTGAAACTGCCGGACTAGAATTCGGATGCCGTGGGAGGAATGTGTAGTGTAGCGGTGAAATGCT TAGATATTACACAGAACACCGATTGCGAAGGCATCTCACGAATCCGACATTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGGATTAGATACCCCTGTAGTCCGG 57_2 M01032:110:000000000-A5D3N:1:1101:16915:1587 1:N:0:0 orig_bc=AGTAGAGGGATG new_bc=AGTAGAGGGATG bc_diffs=0 TACGGAAGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCGGTCAAATGTCAGGGCCCAACCTTGGCCTGCCGTTGATACTGGTAGTCTTGAATACACACAAGGAAGATGGAATTCGTCGTGTAGCGGTGAGATGCTT AGATATGACGAAGAACTCCGATTGCGAAGGCAGTCTTCTGGGGTGCGATTGACGCTGAGGCTCGAAAGTGCGGGAATCAAACAGGATTAGAAACCCCAGTAGTCCGGC 14_3 M01032:110:000000000-A5D3N:1:1101:15513:1588 1:N:0:0 orig_bc=CCTCGTTCGACT new_bc=CCTCGTTCGACT bc_diffs=0 TTAGATACCCTAGTAGTCCGGCTGACTGACTCCTCGTTCGACTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACCGGATCCTAACTCCGGAACTGCCGATGATACAGATGTGCTGGAATACAGATGCCGTGGGATCAATTAGTAGTGTATCGGTGAAACACA TAGATATTACTCAGAACACCGATTGCGAAGTCATCTCACGAAGCAGGTATTGACGCTGATGCACGAAAACGTGGGGATCAAACAACACTAGAAACCCCAGAATGCCGG 20_4 M01032:110:000000000-A5D3N:1:1101:15485:1588 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCGCTTTAAGTCAGTGGTCAAATCGTGAGGCTCAACCTCATCCCGCCATTGATACTGGAGCGCTTGATTGCGGTTGAGGTAGGCGGAATTCGTCGTGTAGCGGTGAAATGCAT AGATATGACGAAGAACCCCGATTGCGTAGGCAGCTTACCAGACCGACAATGACGCTCATGCACGAAAGTGCGGGGATCGAAAAGGATTAGAAACCCCAGTAGTCCGGC 50_5 M01032:110:000000000-A5D3N:1:1101:16649:1588 1:N:0:0 orig_bc=TAGGAACTGGCC new_bc=TAGGAACTGGCC bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGAAGGTCAAGTCAGCTGTGGAATGTAGTCGCTCAACGTCTGCACTGCAGTTGAAACTGGCCTCCTTGAGTGCGTAAGAGGCAGGCGGAATTCGTCGTGTAGCGGTGAAATGCT TAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGGCCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCAAACAGGATTAGATACCCGGGTAGTCCGG 83_6 M01032:110:000000000-A5D3N:1:1101:17160:1588 1:N:0:0 orig_bc=ACATTCAGCGCA new_bc=ACATTCAGCGCA bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGATGAGTAAGTCAGCGGTGAAATACCCCAGCTCAACTGGGGGGCTGCCGTTGATACTGCTTATCTAGAGTGCGAACGGCGCCGGCGGAATGTGTCATGTAGCGGTGAAATGCTTAGAGATGACACAGAAACCCGATCGCGAAGGCAGCCGGCGAGCACGACACTGACGCTGAGGCACGAAGGTGCGGGGATCAAACAGGATTAGATACCCGTGAAGTCCGG 64_7 M01032:110:000000000-A5D3N:1:1101:16520:1589 1:N:0:0 orig_bc=GAATCTTCGAGC new_bc=GAATCTTCGAGC bc_diffs=0 TACGTAGGTTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGGCCTTTAAGTCAGTGGTCAAAGCGTGTGGCTCAACCCTACCACGCCGTTGATACTGGAGGCCTTGAGTGCACATAAGGATGGTGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTGTCCGGGGCGTAACTGACGCTAATGCTCGAAAGTGCGGGTATCAAACAGGATTAGATACCCCAGTAGTCCGGC 20_8 M01032:110:000000000-A5D3N:1:1101:15950:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAAGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCATGCTAAGTCTGCCGTCAAATGGCAGGGCTCAACCCTGTCTTGCGGTGGAAACTGATGGGCTTGAGTACACTCGAGGCAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTGCCTGGTGTGCGATTGACGCTGAGGCTCGAAGGTGCGGGAATCAAACAGGATTAGATACCCGAGTAGTCCGGC 40_9 M01032:110:000000000-A5D3N:1:1101:15928:1589 1:N:0:0 orig_bc=AAGAGATGTCGA new_bc=AAGAGATGTCGA bc_diffs=0 TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGCCTGGCAAGCCAGAGGTGAAAACCCGGGGCTCAACCCCGTGATTGCCTTTGGAACTGTTAGGCTTGAGTACTGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACAGAAACTGACGCTGGGGCTCGAAAGCGTGGGGGGCAAACAGGATTAGTTACCCCGGTAGCCGGG 20_10 M01032:110:000000000-A5D3N:1:1101:14491:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCTTTTTAAGTCAGTGGTTAAATCGTGACGCTCAACGTCATCACGCCATTGATACTGGAGAGCTTGATTGCGGTCGAGGTTTGCGGAATTCGTTGTGTAGCGGTGAAATGCATAGATATGACGAAGAACACCGATTGCGTAGGCAGCAGACCAGGCCGTAAATGACGCTCATGCACGAAAGTGCGGGGATCGAACAGGATTAGATACCCGGGTAGTCCGGC

On Jun 16, 2015, at 11:11 AM, Robert Leach notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Koji,

Thanks for using CFF. Let me explain how getReals.pl works, but first, let me back up to the previous step....

getCandidates.pl selects sequences it thinks are real by comparing each sequence with the abundance you would expect to get if it was an erroneous sequence (as compared to neighboring sequences that differ only by 1 base). It uses the estimated error rate of each nucleotide change to predict the likelihood that its neighbors were misread during PCR to produce it. If the sequence's actual abundance is above the erroneous predicted abundance by a given threshold, it is assumed to be real.

Now getReals.pl takes it a step further. getReals.pl looks across multiple samples (e.g. a time series) and considers all the candidates from different sample files and if it sees the same candidate sequence a threshold number of times, it assumes the sequence is real. It also does filtering for chimeras.

So, running getReals.pl requires multiple sample files (at least 2) in order to run. Essentially, your results with a single sample file are the candidate files.

If you wish to perform chimera filtering on a single sample file, currently, you would have to run uchime on your own. getReals.pl runs ucime in 2 separate calls:

usearch -usearch_global $tmp_lib -db $gcand_file -id 1.0 -idprefix $seqlen -matched $cands_with_global_abund -strand plus -quiet

usearch -uchime_denovo $cands_with_global_abund -minuniquesize 2 -nonchimeras $tmp_out_file $aln_arg -quiet

where:

$tmp_lib = a temporary library file output in the first step. $gcand_file = your candidates file. $seqlen = the length of your sequences. $cands_with_global_abund = a temporary candidate file output in the first step. $tmp_out_file = your chimera output file. $aln_arg = your chimera alignment output file.

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 16, 2015, at 7:05 AM, kojiyasuda notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

It looks like my run was terminated at "getReals" step and I am not sure how to fix this. Does this error make sense to you? I placed my input fna file containing 16S V4, Illumina stitched 250nt amplicons into Caporaso_FSTA folder(after removing the original examples) since the example1 ran nicely and I didn't know how to run a new command. Hence it says Caporaso_FASTA, but it actually contained my fna file that I renamed to one of the example fna files. Thanks for looking into this! Koji

/Users/kojiyasuda/Desktop/CFF-master/samples/run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Jun 15 18:57:44 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out mergeSeqs.pl 'Caporaso_FASTA/L6S2?_19???.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p '' -- 1989 seconds neighbors.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -o .nbrs -- 995 seconds errorRates.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -z 2 -o .erates -- 262 seconds nZeros.pl 'Caporaso_FASTA_out/1_lib/{L6S2?_19???.fna}.lib' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -r 'Caporaso_FASTA_out/1_lib/global_library.fna.erates' -o .n0s --outdir 'Caporaso_FASTA_out/2_n0s' -- 348 seconds getCandidates.pl 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -o .cands -h 10 --outdir 'Caporaso_FASTA_out/3_cands' -- 256 seconds getReals.pl -i 'Caporaso_FASTA_out/3_cands/{L6S2?_19???.fna}.lib.n0s.cands' -n 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -f 'Caporaso_FASTA_out/1_lib/global_library.fna' -k 2 --outdir 'Caporaso_FASTA_out/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112465264.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112928135.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112955998.

hepcat72 commented 9 years ago

Hi Koji,

Regarding the wc commands, this:

Horse_275trimmed_file.fna.[0-9]*

should correspond to the output files from the perl 1-liner. So if you ran:

perl -e 'open(IN,$ARGV[0]);while(){if(/^>(\d+)_/){if(defined($of)){close(OUT)}$of="$ARGV[0].$1";open(OUT,">>$of");select(OUT);}print}close(IN);close(OUT);' Horse_275trimmed_file.fna

Then this command:

ls Horse_275trimmed_file.fna.[0-9]*

should show you your output files. If that doesn't work, then perhaps you were in a different directory, moved the output files, or renamed them? The ls command above has to show you your files in order for this command to work:

cat Horse_275trimmed_file.fna.[0-9]* | wc -l

Or, I suppose it's possible that you're running a shell that does not expand character classes (e.g. "[0-9]"). If that's the case, see what you get when you run this:

ls Horse_275trimmed_file.fna.*

If you only see the 1-liner output files, then use that file pattern for the wc command above instead. (If you see more than the 1-liner output files, you'll have to just fully list all of them individually on the command line.)

Regarding your new data set, could you send me the output of this command on your command line:

head -n 50 Caporaso_FASTA/10oralEND_demuxed_truncated.fna | grep -c -E '^>'

If it produces an error, please paste that into the email. If it produces no output, please attach the file. If you'd rather not sent the whole file, run this command:

head -n 50 Caporaso_FASTA/10oralEND_demuxed_truncated.fna > tmp.fna

and send me the tmp.fna output file.

In the meantime, you should be able to get past this problem by editing this file:

/usr/local/bin/run_CFF_on_FastA.tcsh

Where these lines occur:

echo -n "mergeSeqs.pl '$FASTAS' -f '$LIB' --outdir '$ANALDIR/2_lib' -o .lib -b $TRIMLEN -p ''" mergeSeqs.pl "$FASTAS" -f "$LIB" --outdir "$ANALDIR/1_lib" -o .lib -p '' -b $TRIMLEN --overwrite

Add the bolded options reflected in the lines below:

echo -n "mergeSeqs.pl '$FASTAS' -f '$LIB' --outdir '$ANALDIR/2_lib' -o .lib -b $TRIMLEN -p '' -t fasta" mergeSeqs.pl "$FASTAS" -f "$LIB" --outdir "$ANALDIR/1_lib" -o .lib -p '' -b $TRIMLEN --overwrite -t fasta

Thanks, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

Begin forwarded message:

From: kojiyasuda notifications@github.com Subject: Re: [CFF] ERROR: Command getReals.pl failed (#4) Date: June 18, 2015 at 9:53:36 AM EDT To: hepcat72/CFF CFF@noreply.github.com Cc: Robert Leach rleach@genomics.princeton.edu Reply-To: hepcat72/CFF reply@reply.github.com

Hi Rob,

Thank you so much for your help yesterday. While trying to find the rest of horse files, I am now trying to run CFF on a different study set and having a trouble. 1) I’ve placed individual fna files into “Caporaso_FASTA” folder as below. 2) changes the “run_example1.tcsh” from "Caporaso_FASTQ/L6S2?.fastq” to “Caporaso_FASTA/*truncated.fna“, 3) ran this command line "tcsh run_example1.tcsh” and received an error (below) saying that it cannot recognize input files.

It must be something I changed in the “run_example1.tcsh” file or how my input fna files are formatted (example of fna files are also attached here).

Please let me know if you have any thoughts on this. Thank you so much! Koji


[cid:9BDD1CEE-8212-4A65-8F0A-CFC2BFF8929E@rc.fas.harvard.edu]


Koji-Yasudas-MacBook-Pro:samples kojiyasuda$ tcsh run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Thu Jun 18 09:39:59 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out

mergeSeqs.pl 'Caporaso_FASTA/*truncated.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p ''ERROR1: Unable to determine file type. Skipping file [Caporaso_FASTA/10oralEND_demuxed_truncated.fna]. ERROR2: No sequences found in input file [Caporaso_FASTA/10oralEND_demuxed_truncated.fna].


10StoolEND_CAACTCCCGTGA_137 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGTTGTTTAAGTCTGTTGTGAAAGTTT GCGGCTCAACCCTAATATTGCTGTTGATACTGGATATCTTGAGTGCAGTAGAGGCAGGCGGAATTCGTTGTGTAGCGGTG AAATGCGTAGATATCAGGAAGAACACCGATTGCGAAGGCAGCGTGCTGGGCTGCA 10StoolEND_CAACTCCCGTGA_312 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGATTTTTAAGTCAGCGGTCAAATCGT GGGGCTCAACCCCATCCAGCCGTTGAAACTGGGGATCTAGAGTGTGCGAGAGGTATGCGGAATGCGTGGTGTAGCGGTGA AATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATACCGGTGCACAA 10StoolEND_CAACTCCCGTGA_792 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGATTGTTAAGTCAGTTGTGAAAGTTT GCGGCTCAACCGTAAAATTGCAGTTGAAACTGGCAGTCTTGAGTACAGTAGAGGTGGGCGGAATTCGTGGTGTAGCGGTG AAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGTAGCTCACTGGACTGCA 10StoolEND_CAACTCCCGTGA_961 TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCC CCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTG AAATGCGTAGATATATGGAGGAACACCAGTGGCGGAGGCGGCTCTCTGGTCTGTA


On Jun 17, 2015, at 8:27 PM, Koji Yasuda koji_yasuda@hms.harvard.edu<mailto:koji_yasuda@hms.harvard.edu> wrote:

Thank you again, Rob. I don’t think I am doing the second part right…

Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ wc -l Horse_275trimmed_file.fna 26486312 Horse_275trimmed_file.fna Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmed_file.fna.[0-9]* | wc -l cat: Horse_275trimmedfile.fna.[0-9]: No such file or directory 0 Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmedfile.[0-9] | wc -l cat: Horse_275trimmedfile.[0-9]: No such file or directory 0 Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmedfile.fna.[0-9] | wc -l cat: Horse_275trimmed_file.fna.[0-9]*: No such file or directory

On Jun 17, 2015, at 5:34 PM, Robert Leach notifications@github.com<mailto:notifications@github.com> wrote:

BTW, a good sanity check to make sure the 1-liner worked correctly is to make sure these produce the same number:

wc -l your_file

cat your_file.[0-9]* | wc -l

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 5:24 PM, kojiyasuda notifications@github.com<mailto:notifications@github.com> wrote:

wow, thank you very much. the script worked like a magic. The sample number was placed after “fna”, which we probably need to fix before running CFF. One thing I noticed is that this file only contained 1/2 of the data. I am going to look for the other half of the data and then will run CFF.

L6S21_19130.fna.44

Thank you very much again for your prompt help. I will keep you posted if the run goes nicely! Koji

On Jun 17, 2015, at 3:57 PM, Robert Leach notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

This perl 1-liner should split your file for you:

perl -e 'open(IN,$ARGV[0]);while(){if(/^>(\d+)_/){if(defined($of)){close(OUT)}$of="$ARGV[0].$1";open(OUT,">>$of");select(OUT);}print}close(IN);close(OUT);' your_file

Replace "your_file" with the path/name of your file. Ourput files will end in ".##" where ## is your sample number.

You must copy & paste this onto the command line as-is. If your email program has interrupted the single line with hard returns, you must repair it.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 3:43 PM, kojiyasuda notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Rob,

Thank you so much for explaining the steps. It looks like we would have to split the files by samples, since the one input file used contain all our sequences. We will try splitting the files by samples (in fna format) and will try it again. Or do you have a flag to tell the script to calculate between samples (i.e “54” from “44” see below) from one file?

Thank you again for your kind help. Koji

54_0 M01032:110:000000000-A5D3N:1:1101:14044:1587 1:N:0:0 orig_bc=GAGGCTCATCAT new_bc=GAGGCTCATCAT bc_diffs=0 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTAATAAGTCTGAAGTTAAAGGCAGTGGCTTAACCATTGTTCGCTTTGGAAACTGTTAGACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGT AGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCTGTAGTCCGGC 44_1 M01032:110:000000000-A5D3N:1:1101:16826:1587 1:N:0:0 orig_bc=AAGGAGCGCCTT new_bc=AAGGAGCGCCTT bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCTGACAAGTCAGCGGTGAAATGTCCACGCTCAACGTGGAAAGTGCCGTTGAAACTGCCGGACTAGAATTCGGATGCCGTGGGAGGAATGTGTAGTGTAGCGGTGAAATGCT TAGATATTACACAGAACACCGATTGCGAAGGCATCTCACGAATCCGACATTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGGATTAGATACCCCTGTAGTCCGG 57_2 M01032:110:000000000-A5D3N:1:1101:16915:1587 1:N:0:0 orig_bc=AGTAGAGGGATG new_bc=AGTAGAGGGATG bc_diffs=0 TACGGAAGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCGGTCAAATGTCAGGGCCCAACCTTGGCCTGCCGTTGATACTGGTAGTCTTGAATACACACAAGGAAGATGGAATTCGTCGTGTAGCGGTGAGATGCTT AGATATGACGAAGAACTCCGATTGCGAAGGCAGTCTTCTGGGGTGCGATTGACGCTGAGGCTCGAAAGTGCGGGAATCAAACAGGATTAGAAACCCCAGTAGTCCGGC 14_3 M01032:110:000000000-A5D3N:1:1101:15513:1588 1:N:0:0 orig_bc=CCTCGTTCGACT new_bc=CCTCGTTCGACT bc_diffs=0 TTAGATACCCTAGTAGTCCGGCTGACTGACTCCTCGTTCGACTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACCGGATCCTAACTCCGGAACTGCCGATGATACAGATGTGCTGGAATACAGATGCCGTGGGATCAATTAGTAGTGTATCGGTGAAACACA TAGATATTACTCAGAACACCGATTGCGAAGTCATCTCACGAAGCAGGTATTGACGCTGATGCACGAAAACGTGGGGATCAAACAACACTAGAAACCCCAGAATGCCGG 20_4 M01032:110:000000000-A5D3N:1:1101:15485:1588 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCGCTTTAAGTCAGTGGTCAAATCGTGAGGCTCAACCTCATCCCGCCATTGATACTGGAGCGCTTGATTGCGGTTGAGGTAGGCGGAATTCGTCGTGTAGCGGTGAAATGCAT AGATATGACGAAGAACCCCGATTGCGTAGGCAGCTTACCAGACCGACAATGACGCTCATGCACGAAAGTGCGGGGATCGAAAAGGATTAGAAACCCCAGTAGTCCGGC 50_5 M01032:110:000000000-A5D3N:1:1101:16649:1588 1:N:0:0 orig_bc=TAGGAACTGGCC new_bc=TAGGAACTGGCC bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGAAGGTCAAGTCAGCTGTGGAATGTAGTCGCTCAACGTCTGCACTGCAGTTGAAACTGGCCTCCTTGAGTGCGTAAGAGGCAGGCGGAATTCGTCGTGTAGCGGTGAAATGCT TAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGGCCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCAAACAGGATTAGATACCCGGGTAGTCCGG 83_6 M01032:110:000000000-A5D3N:1:1101:17160:1588 1:N:0:0 orig_bc=ACATTCAGCGCA new_bc=ACATTCAGCGCA bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGATGAGTAAGTCAGCGGTGAAATACCCCAGCTCAACTGGGGGGCTGCCGTTGATACTGCTTATCTAGAGTGCGAACGGCGCCGGCGGAATGTGTCATGTAGCGGTGAAATGCTTAGAGATGACACAGAAACCCGATCGCGAAGGCAGCCGGCGAGCACGACACTGACGCTGAGGCACGAAGGTGCGGGGATCAAACAGGATTAGATACCCGTGAAGTCCGG 64_7 M01032:110:000000000-A5D3N:1:1101:16520:1589 1:N:0:0 orig_bc=GAATCTTCGAGC new_bc=GAATCTTCGAGC bc_diffs=0 TACGTAGGTTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGGCCTTTAAGTCAGTGGTCAAAGCGTGTGGCTCAACCCTACCACGCCGTTGATACTGGAGGCCTTGAGTGCACATAAGGATGGTGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTGTCCGGGGCGTAACTGACGCTAATGCTCGAAAGTGCGGGTATCAAACAGGATTAGATACCCCAGTAGTCCGGC 20_8 M01032:110:000000000-A5D3N:1:1101:15950:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAAGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCATGCTAAGTCTGCCGTCAAATGGCAGGGCTCAACCCTGTCTTGCGGTGGAAACTGATGGGCTTGAGTACACTCGAGGCAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTGCCTGGTGTGCGATTGACGCTGAGGCTCGAAGGTGCGGGAATCAAACAGGATTAGATACCCGAGTAGTCCGGC 40_9 M01032:110:000000000-A5D3N:1:1101:15928:1589 1:N:0:0 orig_bc=AAGAGATGTCGA new_bc=AAGAGATGTCGA bc_diffs=0 TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGCCTGGCAAGCCAGAGGTGAAAACCCGGGGCTCAACCCCGTGATTGCCTTTGGAACTGTTAGGCTTGAGTACTGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACAGAAACTGACGCTGGGGCTCGAAAGCGTGGGGGGCAAACAGGATTAGTTACCCCGGTAGCCGGG 20_10 M01032:110:000000000-A5D3N:1:1101:14491:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCTTTTTAAGTCAGTGGTTAAATCGTGACGCTCAACGTCATCACGCCATTGATACTGGAGAGCTTGATTGCGGTCGAGGTTTGCGGAATTCGTTGTGTAGCGGTGAAATGCATAGATATGACGAAGAACACCGATTGCGTAGGCAGCAGACCAGGCCGTAAATGACGCTCATGCACGAAAGTGCGGGGATCGAACAGGATTAGATACCCGGGTAGTCCGGC

On Jun 16, 2015, at 11:11 AM, Robert Leach notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Koji,

Thanks for using CFF. Let me explain how getReals.pl works, but first, let me back up to the previous step....

getCandidates.pl selects sequences it thinks are real by comparing each sequence with the abundance you would expect to get if it was an erroneous sequence (as compared to neighboring sequences that differ only by 1 base). It uses the estimated error rate of each nucleotide change to predict the likelihood that its neighbors were misread during PCR to produce it. If the sequence's actual abundance is above the erroneous predicted abundance by a given threshold, it is assumed to be real.

Now getReals.pl takes it a step further. getReals.pl looks across multiple samples (e.g. a time series) and considers all the candidates from different sample files and if it sees the same candidate sequence a threshold number of times, it assumes the sequence is real. It also does filtering for chimeras.

So, running getReals.pl requires multiple sample files (at least 2) in order to run. Essentially, your results with a single sample file are the candidate files.

If you wish to perform chimera filtering on a single sample file, currently, you would have to run uchime on your own. getReals.pl runs ucime in 2 separate calls:

usearch -usearch_global $tmp_lib -db $gcand_file -id 1.0 -idprefix $seqlen -matched $cands_with_global_abund -strand plus -quiet

usearch -uchime_denovo $cands_with_global_abund -minuniquesize 2 -nonchimeras $tmp_out_file $aln_arg -quiet

where:

$tmp_lib = a temporary library file output in the first step. $gcand_file = your candidates file. $seqlen = the length of your sequences. $cands_with_global_abund = a temporary candidate file output in the first step. $tmp_out_file = your chimera output file. $aln_arg = your chimera alignment output file.

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 16, 2015, at 7:05 AM, kojiyasuda notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

It looks like my run was terminated at "getReals" step and I am not sure how to fix this. Does this error make sense to you? I placed my input fna file containing 16S V4, Illumina stitched 250nt amplicons into Caporaso_FSTA folder(after removing the original examples) since the example1 ran nicely and I didn't know how to run a new command. Hence it says Caporaso_FASTA, but it actually contained my fna file that I renamed to one of the example fna files. Thanks for looking into this! Koji

/Users/kojiyasuda/Desktop/CFF-master/samples/run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Jun 15 18:57:44 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out mergeSeqs.pl 'Caporaso_FASTA/L6S2?_19???.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p '' -- 1989 seconds neighbors.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -o .nbrs -- 995 seconds errorRates.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -z 2 -o .erates -- 262 seconds nZeros.pl 'Caporaso_FASTA_out/1_lib/{L6S2?_19???.fna}.lib' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -r 'Caporaso_FASTA_out/1_lib/global_library.fna.erates' -o .n0s --outdir 'Caporaso_FASTA_out/2_n0s' -- 348 seconds getCandidates.pl 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -o .cands -h 10 --outdir 'Caporaso_FASTA_out/3_cands' -- 256 seconds getReals.pl -i 'Caporaso_FASTA_out/3_cands/{L6S2?_19???.fna}.lib.n0s.cands' -n 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -f 'Caporaso_FASTA_out/1_lib/global_library.fna' -k 2 --outdir 'Caporaso_FASTA_out/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112465264.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112928135.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112955998.

— Reply to this email directly or view it on GitHub.

Begin forwarded message:

From: kojiyasuda notifications@github.com Subject: Re: [CFF] ERROR: Command getReals.pl failed (#4) Date: June 17, 2015 at 8:28:03 PM EDT To: hepcat72/CFF CFF@noreply.github.com Cc: Robert Leach rleach@genomics.princeton.edu Reply-To: hepcat72/CFF reply@reply.github.com Return-Path: noreply@github.com Received: from gmx1.princeton.edu (gmx1.Princeton.EDU [128.112.117.2]) by gsmtp.princeton.edu (8.13.7/8.13.7) with ESMTP id t5I0S9oG010211 for rleach@genomics.princeton.edu; Wed, 17 Jun 2015 20:28:09 -0400 (EDT) Received: from pps.filterd (gmx1.princeton.edu [127.0.0.1]) by gmx1.princeton.edu (8.14.7/8.14.7) with SMTP id t5I0S4I9008762 for rleach@genomics.princeton.edu; Wed, 17 Jun 2015 20:28:04 -0400 Received: from github-smtp2b-ext-cp1-prd.iad.github.net (github-smtp2-ext2.iad.github.net [192.30.252.193]) by gmx1.princeton.edu with ESMTP id 1ut54ngk9a-1 for rleach@genomics.princeton.edu; Wed, 17 Jun 2015 20:28:04 -0400 Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com; s=pf2014; t=1434587283; bh=QN9Ah9FLvpuFwIblEp1o5oGGr+npHNsxYWrJBXcBCeg=; h=From:Reply-To:To:Cc:In-Reply-To:References:Subject:List-ID: List-Archive:List-Post:List-Unsubscribe:From; b=QNilbIsbZF56wFrqBMqh+TMALR6vgqMJ6J7x3GnRcAydwrUBx5FpFb2crDE3J3Oo2 rYHZasq8Oxin21TZEGW3MN5WaWJxmUHhbTPeiR+WW6bpyHqmakhy1BXB9bj11Em3Ch UsX4aIwAMbGozjFUARgozQ8cmoyzLcbGDclnMiRU= Message-Id: hepcat72/CFF/issues/4/112987439@github.com In-Reply-To: hepcat72/CFF/issues/4@github.com References: hepcat72/CFF/issues/4@github.com Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="--==_mimepart_55821093922a4_371a3f8ae03bd29c210317"; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: list X-Github-Sender: kojiyasuda X-Github-Recipient: hepcat72 X-Github-Reason: comment List-Id: hepcat72/CFF List-Archive: https://github.com/hepcat72/CFF List-Post: mailto:reply@reply.github.com List-Unsubscribe: mailto:unsub+00231a7487fc28f6eb9ed852457b0e98b66ccdc7f20f2edb92cf000000011199d29392a169ce05495f87@reply.github.com, https://github.com/notifications/unsubscribe/ACMadKNPDt8ktpBURerWFOe7HNveBeDBks5oUggTgaJpZM4FEROH X-Auto-Response-Suppress: All X-Github-Recipient-Address: rleach@genomics.princeton.edu X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.14.151,1.0.33,0.0.0000 definitions=2015-06-17_08:2015-06-16,2015-06-17,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=2 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1506180007

Thank you again, Rob. I don’t think I am doing the second part right…

Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ wc -l Horse_275trimmed_file.fna 26486312 Horse_275trimmed_file.fna Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmed_file.fna.[0-9]* | wc -l cat: Horse_275trimmedfile.fna.[0-9]: No such file or directory 0 Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmedfile.[0-9] | wc -l cat: Horse_275trimmedfile.[0-9]: No such file or directory 0 Koji-Yasudas-MacBook-Pro:Horse_fna_file kojiyasuda$ cat Horse_275trimmedfile.fna.[0-9] | wc -l cat: Horse_275trimmed_file.fna.[0-9]*: No such file or directory

On Jun 17, 2015, at 5:34 PM, Robert Leach notifications@github.com<mailto:notifications@github.com> wrote:

BTW, a good sanity check to make sure the 1-liner worked correctly is to make sure these produce the same number:

wc -l your_file

cat your_file.[0-9]* | wc -l

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 5:24 PM, kojiyasuda notifications@github.com<mailto:notifications@github.com> wrote:

wow, thank you very much. the script worked like a magic. The sample number was placed after “fna”, which we probably need to fix before running CFF. One thing I noticed is that this file only contained 1/2 of the data. I am going to look for the other half of the data and then will run CFF.

L6S21_19130.fna.44

Thank you very much again for your prompt help. I will keep you posted if the run goes nicely! Koji

On Jun 17, 2015, at 3:57 PM, Robert Leach notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

This perl 1-liner should split your file for you:

perl -e 'open(IN,$ARGV[0]);while(){if(/^>(\d+)_/){if(defined($of)){close(OUT)}$of="$ARGV[0].$1";open(OUT,">>$of");select(OUT);}print}close(IN);close(OUT);' your_file

Replace "your_file" with the path/name of your file. Ourput files will end in ".##" where ## is your sample number.

You must copy & paste this onto the command line as-is. If your email program has interrupted the single line with hard returns, you must repair it.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 17, 2015, at 3:43 PM, kojiyasuda notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Rob,

Thank you so much for explaining the steps. It looks like we would have to split the files by samples, since the one input file used contain all our sequences. We will try splitting the files by samples (in fna format) and will try it again. Or do you have a flag to tell the script to calculate between samples (i.e “54” from “44” see below) from one file?

Thank you again for your kind help. Koji

54_0 M01032:110:000000000-A5D3N:1:1101:14044:1587 1:N:0:0 orig_bc=GAGGCTCATCAT new_bc=GAGGCTCATCAT bc_diffs=0 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTAATAAGTCTGAAGTTAAAGGCAGTGGCTTAACCATTGTTCGCTTTGGAAACTGTTAGACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGT AGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCTGTAGTCCGGC 44_1 M01032:110:000000000-A5D3N:1:1101:16826:1587 1:N:0:0 orig_bc=AAGGAGCGCCTT new_bc=AAGGAGCGCCTT bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCTGACAAGTCAGCGGTGAAATGTCCACGCTCAACGTGGAAAGTGCCGTTGAAACTGCCGGACTAGAATTCGGATGCCGTGGGAGGAATGTGTAGTGTAGCGGTGAAATGCT TAGATATTACACAGAACACCGATTGCGAAGGCATCTCACGAATCCGACATTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGGATTAGATACCCCTGTAGTCCGG 57_2 M01032:110:000000000-A5D3N:1:1101:16915:1587 1:N:0:0 orig_bc=AGTAGAGGGATG new_bc=AGTAGAGGGATG bc_diffs=0 TACGGAAGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCGGTCAAATGTCAGGGCCCAACCTTGGCCTGCCGTTGATACTGGTAGTCTTGAATACACACAAGGAAGATGGAATTCGTCGTGTAGCGGTGAGATGCTT AGATATGACGAAGAACTCCGATTGCGAAGGCAGTCTTCTGGGGTGCGATTGACGCTGAGGCTCGAAAGTGCGGGAATCAAACAGGATTAGAAACCCCAGTAGTCCGGC 14_3 M01032:110:000000000-A5D3N:1:1101:15513:1588 1:N:0:0 orig_bc=CCTCGTTCGACT new_bc=CCTCGTTCGACT bc_diffs=0 TTAGATACCCTAGTAGTCCGGCTGACTGACTCCTCGTTCGACTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACCGGATCCTAACTCCGGAACTGCCGATGATACAGATGTGCTGGAATACAGATGCCGTGGGATCAATTAGTAGTGTATCGGTGAAACACA TAGATATTACTCAGAACACCGATTGCGAAGTCATCTCACGAAGCAGGTATTGACGCTGATGCACGAAAACGTGGGGATCAAACAACACTAGAAACCCCAGAATGCCGG 20_4 M01032:110:000000000-A5D3N:1:1101:15485:1588 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCGCTTTAAGTCAGTGGTCAAATCGTGAGGCTCAACCTCATCCCGCCATTGATACTGGAGCGCTTGATTGCGGTTGAGGTAGGCGGAATTCGTCGTGTAGCGGTGAAATGCAT AGATATGACGAAGAACCCCGATTGCGTAGGCAGCTTACCAGACCGACAATGACGCTCATGCACGAAAGTGCGGGGATCGAAAAGGATTAGAAACCCCAGTAGTCCGGC 50_5 M01032:110:000000000-A5D3N:1:1101:16649:1588 1:N:0:0 orig_bc=TAGGAACTGGCC new_bc=TAGGAACTGGCC bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGAAGGTCAAGTCAGCTGTGGAATGTAGTCGCTCAACGTCTGCACTGCAGTTGAAACTGGCCTCCTTGAGTGCGTAAGAGGCAGGCGGAATTCGTCGTGTAGCGGTGAAATGCT TAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGGCCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCAAACAGGATTAGATACCCGGGTAGTCCGG 83_6 M01032:110:000000000-A5D3N:1:1101:17160:1588 1:N:0:0 orig_bc=ACATTCAGCGCA new_bc=ACATTCAGCGCA bc_diffs=0 TACGGAGGATGCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGATGAGTAAGTCAGCGGTGAAATACCCCAGCTCAACTGGGGGGCTGCCGTTGATACTGCTTATCTAGAGTGCGAACGGCGCCGGCGGAATGTGTCATGTAGCGGTGAAATGCTTAGAGATGACACAGAAACCCGATCGCGAAGGCAGCCGGCGAGCACGACACTGACGCTGAGGCACGAAGGTGCGGGGATCAAACAGGATTAGATACCCGTGAAGTCCGG 64_7 M01032:110:000000000-A5D3N:1:1101:16520:1589 1:N:0:0 orig_bc=GAATCTTCGAGC new_bc=GAATCTTCGAGC bc_diffs=0 TACGTAGGTTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGGCCTTTAAGTCAGTGGTCAAAGCGTGTGGCTCAACCCTACCACGCCGTTGATACTGGAGGCCTTGAGTGCACATAAGGATGGTGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGAAGAACTCCGATTGCGAAGGCAGCTGTCCGGGGCGTAACTGACGCTAATGCTCGAAAGTGCGGGTATCAAACAGGATTAGATACCCCAGTAGTCCGGC 20_8 M01032:110:000000000-A5D3N:1:1101:15950:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAAGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCATGCTAAGTCTGCCGTCAAATGGCAGGGCTCAACCCTGTCTTGCGGTGGAAACTGATGGGCTTGAGTACACTCGAGGCAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATGACGAAGAACTCCGATTGCGAAGGCAGCTGCCTGGTGTGCGATTGACGCTGAGGCTCGAAGGTGCGGGAATCAAACAGGATTAGATACCCGAGTAGTCCGGC 40_9 M01032:110:000000000-A5D3N:1:1101:15928:1589 1:N:0:0 orig_bc=AAGAGATGTCGA new_bc=AAGAGATGTCGA bc_diffs=0 TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGCCTGGCAAGCCAGAGGTGAAAACCCGGGGCTCAACCCCGTGATTGCCTTTGGAACTGTTAGGCTTGAGTACTGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACAGAAACTGACGCTGGGGCTCGAAAGCGTGGGGGGCAAACAGGATTAGTTACCCCGGTAGCCGGG 20_10 M01032:110:000000000-A5D3N:1:1101:14491:1589 1:N:0:0 orig_bc=CAGGCGTATTGG new_bc=CAGGCGTATTGG bc_diffs=0 TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGCTTTTTAAGTCAGTGGTTAAATCGTGACGCTCAACGTCATCACGCCATTGATACTGGAGAGCTTGATTGCGGTCGAGGTTTGCGGAATTCGTTGTGTAGCGGTGAAATGCATAGATATGACGAAGAACACCGATTGCGTAGGCAGCAGACCAGGCCGTAAATGACGCTCATGCACGAAAGTGCGGGGATCGAACAGGATTAGATACCCGGGTAGTCCGGC

On Jun 16, 2015, at 11:11 AM, Robert Leach notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Koji,

Thanks for using CFF. Let me explain how getReals.pl works, but first, let me back up to the previous step....

getCandidates.pl selects sequences it thinks are real by comparing each sequence with the abundance you would expect to get if it was an erroneous sequence (as compared to neighboring sequences that differ only by 1 base). It uses the estimated error rate of each nucleotide change to predict the likelihood that its neighbors were misread during PCR to produce it. If the sequence's actual abundance is above the erroneous predicted abundance by a given threshold, it is assumed to be real.

Now getReals.pl takes it a step further. getReals.pl looks across multiple samples (e.g. a time series) and considers all the candidates from different sample files and if it sees the same candidate sequence a threshold number of times, it assumes the sequence is real. It also does filtering for chimeras.

So, running getReals.pl requires multiple sample files (at least 2) in order to run. Essentially, your results with a single sample file are the candidate files.

If you wish to perform chimera filtering on a single sample file, currently, you would have to run uchime on your own. getReals.pl runs ucime in 2 separate calls:

usearch -usearch_global $tmp_lib -db $gcand_file -id 1.0 -idprefix $seqlen -matched $cands_with_global_abund -strand plus -quiet

usearch -uchime_denovo $cands_with_global_abund -minuniquesize 2 -nonchimeras $tmp_out_file $aln_arg -quiet

where:

$tmp_lib = a temporary library file output in the first step. $gcand_file = your candidates file. $seqlen = the length of your sequences. $cands_with_global_abund = a temporary candidate file output in the first step. $tmp_out_file = your chimera output file. $aln_arg = your chimera alignment output file.

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 16, 2015, at 7:05 AM, kojiyasuda notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

It looks like my run was terminated at "getReals" step and I am not sure how to fix this. Does this error make sense to you? I placed my input fna file containing 16S V4, Illumina stitched 250nt amplicons into Caporaso_FSTA folder(after removing the original examples) since the example1 ran nicely and I didn't know how to run a new command. Hence it says Caporaso_FASTA, but it actually contained my fna file that I renamed to one of the example fna files. Thanks for looking into this! Koji

/Users/kojiyasuda/Desktop/CFF-master/samples/run_example1.tcsh

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Jun 15 18:57:44 EDT 2015 Trim length: 130 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: Caporaso_FASTA_out mergeSeqs.pl 'Caporaso_FASTA/L6S2?_19???.fna' -f 'global_library.fna' --outdir 'Caporaso_FASTA_out/2_lib' -o .lib -b 130 -p '' -- 1989 seconds neighbors.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -o .nbrs -- 995 seconds errorRates.pl 'Caporaso_FASTA_out/1_lib/global_library.fna' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -z 2 -o .erates -- 262 seconds nZeros.pl 'Caporaso_FASTA_out/1_lib/{L6S2?_19???.fna}.lib' -n 'Caporaso_FASTA_out/1_lib/global_library.fna.nbrs' -r 'Caporaso_FASTA_out/1_lib/global_library.fna.erates' -o .n0s --outdir 'Caporaso_FASTA_out/2_n0s' -- 348 seconds getCandidates.pl 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -o .cands -h 10 --outdir 'Caporaso_FASTA_out/3_cands' -- 256 seconds getReals.pl -i 'Caporaso_FASTA_out/3_cands/{L6S2?_19???.fna}.lib.n0s.cands' -n 'Caporaso_FASTA_out/2_n0s/{L6S2?_19???.fna}.lib.n0s' -f 'Caporaso_FASTA_out/1_lib/global_library.fna' -k 2 --outdir 'Caporaso_FASTA_out/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112465264.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112928135.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-112955998.

— Reply to this email directly or view it on GitHub.

kojiyasuda commented 9 years ago

Hi Rob,

For the second part of the questions, here are some outputs: Koji-Yasudas-MacBook-Pro:Caporaso_FASTA kojiyasuda$ head -n 50 10StoolEND_demuxed_truncated.fna | grep -c -E '^>' 13

hepcat72 commented 9 years ago

Alright, well if that works, then I'm guessing that you must not be running the example script from the samples directory? The only way that I can imagine that you can get that error, since the command I sent you works, is that the file it is running on is empty and that the file you ran the command I sent you on is named the same, but is not the file the script is looking at.

I would recommend running the analysis script directly instead of using the examples. Anyhow, that is how it is intended to be used. So instead of changing the example script and the original example files, create a directory to hold your input files (just to keep things neat & tidy), move your input files in there, and run everything in there. Like this:

mkdir myanalysis mv _truncated.fna myanalysis/ cd myanalysis run_CFF_on_FastA.tcsh 130 outputdir "_truncated.fna"

Also note, choosing an appropriate trim length is important. 130 is the trim length used in the example data. Your data might be better at a different trim length.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 18, 2015, at 11:26 AM, kojiyasuda notifications@github.com wrote:

Hi Rob,

For the second part of the questions, here are some outputs: Koji-Yasudas-MacBook-Pro:Caporaso_FASTA kojiyasuda$ head -n 50 10StoolEND_demuxed_truncated.fna | grep -c -E '^>' 13

— Reply to this email directly or view it on GitHub.

kojiyasuda commented 9 years ago

Hi Rob,

Thank you so much for being patient and walking through the steps. It worked!! and the data looks great. Once we can locate the other half of hours data, I will run CFF again on that data, I shouldn’t have any issues now, but if I get stuck, I might have to email you!

thank you so much again for walking me through this. Koji

On Jun 18, 2015, at 11:52 AM, Robert Leach notifications@github.com<mailto:notifications@github.com> wrote:

Alright, well if that works, then I'm guessing that you must not be running the example script from the samples directory? The only way that I can imagine that you can get that error, since the command I sent you works, is that the file it is running on is empty and that the file you ran the command I sent you on is named the same, but is not the file the script is looking at.

I would recommend running the analysis script directly instead of using the examples. Anyhow, that is how it is intended to be used. So instead of changing the example script and the original example files, create a directory to hold your input files (just to keep things neat & tidy), move your input files in there, and run everything in there. Like this:

mkdir myanalysis mv _truncated.fna myanalysis/ cd myanalysis run_CFF_on_FastA.tcsh 130 outputdir "_truncated.fna"

Also note, choosing an appropriate trim length is important. 130 is the trim length used in the example data. Your data might be better at a different trim length.

Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jun 18, 2015, at 11:26 AM, kojiyasuda notifications@github.com<mailto:notifications@github.com> wrote:

Hi Rob,

For the second part of the questions, here are some outputs: Koji-Yasudas-MacBook-Pro:Caporaso_FASTA kojiyasuda$ head -n 50 10StoolEND_demuxed_truncated.fna | grep -c -E '^>' 13

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-113199343.

kojiyasuda commented 9 years ago

Hi Rob,

CFF has been running great so far for me, have been happy with the results, but now, I am running a large dataset (500 samples) and want to make sure if it is running correctly or I should terminate and start from where it had a trouble. It’s been saying that it is running “muscle” and “filterIndels.pl” back and forth for the past 36 hours or so. The current output folder looks like this, where nothing has been created in folder called “5_indels” yet, which makes me think that it is still doing something… Since I am running this using “nohup”, the err file looks like this (see attached).

Please let me know if you know anything as to whether the script is still running correctly, and if it is not, I am thinking I can start from the files created in 3_cands, and if so, how can I finish the process?

Thank you so much for your time, and I really appreciate your input on this! thank you again, Koji

[kyasuda@hutlab3 outputdir_adipose_CFF]$ ls 1_lib 2_n0s 3_cands 4_reals_table 5_indels

hepcat72 commented 9 years ago

Hi Koji,

Looks like I didn't get the attachment for whatever reason, but I can offer tips in the interim. The filterIndels step is the time-limiting factor in a CFF run because it is essentially doing a multiple sequence alignment to find indels.

filterIndels.pl offers a number of custom options for speeding up a slow run.

If your data has a tendency for homopolymer errors (e.g. as does 454, ion torrent, etc), the fastest speedup you can possibly get is to run in --homopolymer-mode. This would assume that any non-homopolymer indel is real (i.e. that the polymerase is not erroneously introducing indels). --homopolymer-mode is extremely fast compared to the default usage of muscle for sequence alignment. It is very accurate, but does not catch indels that involve non-repeated bases. This flag is an option that must be provided in the call of filterIndels.pl.

If however, you are dealing with a sequencing technology that is not prone to homopolymer errors (e.g. illumina), then there are a number of options provided by the script. First, filterIndels.pl can take advantage of multiple cores. Simply running on a machine with more cores will speed up the computation linearly with each added core. E.g. 2 cores to 4 cores with roughly halve the running time. [Note though that sometimes the perl module which detects the number of cores (and the system memory) can be inaccurate. To ensure you're using all your system's resources, you can set the number of cores manually with the --parallel-processes option. Similarly, you can set the amount of ram with the --gigs-ram option.]

If you do not have access to a machine with more cores, there are a few more options you can try, but you will sacrifice a bit of accuracy for speed:

--align-mode global Aligns more sequences at a time, which is significantly faster, but less accurate due to complexity. The default mode will only align groups of highly similar sequences (using -v).

-v N (aka --heuristic-str-size N) Increase N from it's default of 11. This hashing heuristic allows you to skip alignments that do not share an N-base string in a different position. There are a few options that allow you to tweak the way -v works. To see a list of advanced options, run:

filterIndels.pl --extended

To do any of these things, you will have to run your analysis manually, step by step, instead of using the pipelined shell script. Or you can edit the shell script itself to add these options. (Note if you update your copy of CFF, your changes would be lost, so I'd recommend copying the shell script and editing the copy.)

To rerun from where you left off at the last fully completed candidates step, you can either run each step manually, replace the --overwrite flag with the --skip-existing flag to every script call except the one that is partially done (i.e. filterIndels.pl), or comment out the steps that were completed (using the # character).

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jul 28, 2015, at 1:57 PM, kojiyasuda notifications@github.com wrote:

Hi Rob,

CFF has been running great so far for me, have been happy with the results, but now, I am running a large dataset (500 samples) and want to make sure if it is running correctly or I should terminate and start from where it had a trouble. It’s been saying that it is running “muscle” and “filterIndels.pl” back and forth for the past 36 hours or so. The current output folder looks like this, where nothing has been created in folder called “5_indels” yet, which makes me think that it is still doing something… Since I am running this using “nohup”, the err file looks like this (see attached).

Please let me know if you know anything as to whether the script is still running correctly, and if it is not, I am thinking I can start from the files created in 3_cands, and if so, how can I finish the process?

Thank you so much for your time, and I really appreciate your input on this! thank you again, Koji

[kyasuda@hutlab3 outputdir_adipose_CFF]$ ls 1_lib 2_n0s 3_cands 4_reals_table 5_indels

— Reply to this email directly or view it on GitHub.

kojiyasuda commented 9 years ago

HI Rob,

Thank you so much for explaining all of this to me. That makes sense why filterIndels is taking a long time to complete. Since it has been running for 36 hours already, I am going to wait until this evening, if is it still running, as you’ve suggested, I am going to move these files into one of servers that will allow me to use multiple cores (n=6).

These are illumina reads and thank you so much for listing all the options we can do with CFF. I did not know this, would love to try some of them, if needed. I will keep you posted regardless if this run goes well or not.

Thank you again so much for being really helpful. koji

On Jul 28, 2015, at 2:54 PM, Robert Leach notifications@github.com<mailto:notifications@github.com> wrote:

Hi Koji,

Looks like I didn't get the attachment for whatever reason, but I can offer tips in the interim. The filterIndels step is the time-limiting factor in a CFF run because it is essentially doing a multiple sequence alignment to find indels.

filterIndels.pl offers a number of custom options for speeding up a slow run.

If your data has a tendency for homopolymer errors (e.g. as does 454, ion torrent, etc), the fastest speedup you can possibly get is to run in --homopolymer-mode. This would assume that any non-homopolymer indel is real (i.e. that the polymerase is not erroneously introducing indels). --homopolymer-mode is extremely fast compared to the default usage of muscle for sequence alignment. It is very accurate, but does not catch indels that involve non-repeated bases. This flag is an option that must be provided in the call of filterIndels.pl.

If however, you are dealing with a sequencing technology that is not prone to homopolymer errors (e.g. illumina), then there are a number of options provided by the script. First, filterIndels.pl can take advantage of multiple cores. Simply running on a machine with more cores will speed up the computation linearly with each added core. E.g. 2 cores to 4 cores with roughly halve the running time. [Note though that sometimes the perl module which detects the number of cores (and the system memory) can be inaccurate. To ensure you're using all your system's resources, you can set the number of cores manually with the --parallel-processes option. Similarly, you can set the amount of ram with the --gigs-ram option.]

If you do not have access to a machine with more cores, there are a few more options you can try, but you will sacrifice a bit of accuracy for speed:

--align-mode global Aligns more sequences at a time, which is significantly faster, but less accurate due to complexity. The default mode will only align groups of highly similar sequences (using -v).

-v N (aka --heuristic-str-size N) Increase N from it's default of 11. This hashing heuristic allows you to skip alignments that do not share an N-base string in a different position. There are a few options that allow you to tweak the way -v works. To see a list of advanced options, run:

filterIndels.pl --extended

To do any of these things, you will have to run your analysis manually, step by step, instead of using the pipelined shell script. Or you can edit the shell script itself to add these options. (Note if you update your copy of CFF, your changes would be lost, so I'd recommend copying the shell script and editing the copy.)

To rerun from where you left off at the last fully completed candidates step, you can either run each step manually, replace the --overwrite flag with the --skip-existing flag to every script call except the one that is partially done (i.e. filterIndels.pl), or comment out the steps that were completed (using the # character).

Good luck, Rob

Robert William Leach 133A Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544

On Jul 28, 2015, at 1:57 PM, kojiyasuda notifications@github.com<mailto:notifications@github.com> wrote:

Hi Rob,

CFF has been running great so far for me, have been happy with the results, but now, I am running a large dataset (500 samples) and want to make sure if it is running correctly or I should terminate and start from where it had a trouble. It’s been saying that it is running “muscle” and “filterIndels.pl” back and forth for the past 36 hours or so. The current output folder looks like this, where nothing has been created in folder called “5_indels” yet, which makes me think that it is still doing something… Since I am running this using “nohup”, the err file looks like this (see attached).

Please let me know if you know anything as to whether the script is still running correctly, and if it is not, I am thinking I can start from the files created in 3_cands, and if so, how can I finish the process?

Thank you so much for your time, and I really appreciate your input on this! thank you again, Koji

[kyasuda@hutlab3 outputdir_adipose_CFF]$ ls 1_lib 2_n0s 3_cands 4_reals_table 5_indels

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/hepcat72/CFF/issues/4#issuecomment-125716968.

kojiyasuda commented 8 years ago

Hi Rob,

Thank you once again for your help during my last dataset! I am trying to run another dataset through CFF and have been getting this error. I had one consolidated sequencing file, so wrote this script to separate sequences per sample (see attached for the script if helpful). I am also pasting a several sequences to this email below.

Anything stand out to you that I am something wrong? Thank you so much again for your help, Koji


[kyasuda@hutlab3 ben_CFF]$ ls 100.fna 101.fna 102.fna 103.fna 104.fna 105.fna 106.fna 107.fna 108.fna 88.fna 89.fna 91.fna 92.fna 94.fna 95.fna 96.fna 99.fna [kyasuda@hutlab3 ben_CFF]$ run_CFF_on_FastA.tcsh 253 outputdir “*.fna” & [1] 27134 [kyasuda@hutlab3 ben_CFF]$

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Oct 19 15:24:36 EDT 2015 Trim length: 253 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: outputdir

mergeSeqs.pl '“_.fna”' -f 'global_library.fna' --outdir 'outputdir/2lib' -o .lib -b 253 -p ''ERROR1: Unable to open input file: [“.fna”]. No such file or directory

Done. EXIT STATUS: [ERRORS: 1 WARNINGS: 0 TIME: 0s] Scroll up to inspect full errors/warnings in-place. Supply --verbose for extended run report. -- 0 seconds neighbors.pl 'outputdir/1_lib/global_library.fna' -o .nbrsERROR1: Unable to open input file: [outputdir/1_lib/global_library.fna]. No such file or directory

Done. EXIT STATUS: [ERRORS: 1 WARNINGS: 0 TIME: 1s] Scroll up to inspect full errors/warnings in-place. -- 1 seconds errorRates.pl 'outputdir/1_lib/global_library.fna' -n 'outputdir/1_lib/global_library.fna.nbrs' -z 2 -o .eratesERROR1: Unable to open input file: [outputdir/1_lib/global_library.fna]. No such file or directory

Done. EXIT STATUS: [ERRORS: 1 WARNINGS: 0 TIME: 0s] Scroll up to inspect full errors/warnings in-place. -- 0 seconds nZeros.pl 'outputdir/1lib/{“.fna”}.lib' -n 'outputdir/1_lib/global_library.fna.nbrs' -r 'outputdir/1_lib/global_library.fna.erates' -o .n0s --outdir 'outputdir/2_n0s'ERROR1: Unable to open input file: [outputdir/1lib/{“.fna”}.lib]. No such file or directory ERROR2: Unable to parse file [outputdir/1_lib/{“*.fna”}.lib]. Skipping.

Done. EXIT STATUS: [ERRORS: 2 WARNINGS: 0 TIME: 0s] Scroll up to inspect full errors/warnings in-place. -- 0 seconds getCandidates.pl 'outputdir/2n0s/{“.fna”}.lib.n0s' -o .cands -h 10 --outdir 'outputdir/3_cands'ERROR1: Unable to open input file: [outputdir/2n0s/{“.fna”}.lib.n0s]. No such file or directory WARNING1: No candidates found in any of the input files.

Done. EXIT STATUS: [ERRORS: 1 WARNINGS: 1 TIME: 0s] Scroll up to inspect full errors/warnings in-place. -- 0 seconds getReals.pl -i 'outputdir/3cands/{“.fna”}.lib.n0s.cands' -n 'outputdir/2n0s/{“.fna”}.lib.n0s' -f 'outputdir/1_lib/global_library.fna' -k 2 --outdir 'outputdir/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

ERROR: Command getReals.pl failed Stop time: Mon Oct 19 15:24:37 EDT 2015 RUN TIME: 1 seconds


100_141486 100_0 M00620:36:000000000-A41B2:1:1101:16947:2270 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 GACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTCAAGTCCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCATTGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCAAATGGG 100_141487 100_1 M00620:36:000000000-A41B2:1:1101:18703:2285 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG 100_141488 100_2 M00620:36:000000000-A41B2:1:1101:17596:2301 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGAAGGGTGCAAGCGTTACTCGGAATTACTGGGCGTAAAGCGTGCGTAGGTGGTTGTTTAAGTCTGTTGTGAAAGCCCTGGGCTCAACCTGGGAACTGCAGTGGAAACTGGACGACTAGAGTGTGGTAGAGGGTAGCGGAATTCCTGGTGTAGCAGTGAAATGCGTAGAGACCAGGAGGAACATCCATGGCGAAGGCAGCTACCTGGACCAACACTGACACTGAGGCACGAAAGCGTGGGGAGCAAACAGG 100_141489 100_3 M00620:36:000000000-A41B2:1:1101:12928:2341 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG 100_141490 100_4 M00620:36:000000000-A41B2:1:1101:18558:2375 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAATACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG 100_141491 100_5 M00620:36:000000000-A41B2:1:1101:18548:2393 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG 100_141492 100_6 M00620:36:000000000-A41B2:1:1101:14243:2462 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGAAGGGGGCTAGCGTTGTTCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGACTTTTAAGTCAGGGGTGAAATCCCAGAGCTCAACTCTGGAACTGCCTTTGATACTGGAAGTCTTGAGTATGGTAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGACCATTACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG 100_141493 100_7 M00620:36:000000000-A41B2:1:1101:16768:2492 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGGAGGGTGCAAGCGTTAATCGGAGTTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG

kojiyasuda commented 8 years ago

Hi Rob,

Thank you so much for all of your help previously.

I keep getting this error “ ERROR1: Unable to open input file: [“*.fna”]. Do you have any idea why this error keep s coming up? I’d love to run CFF on several dataset. Is this an error on how we’ve installed CFF on our server? in that case I am cc’ing Randall and Lauren.

Thank you so much! Koji

On Oct 19, 2015, at 3:30 PM, Koji Yasuda koji_yasuda@hms.harvard.edu<mailto:koji_yasuda@hms.harvard.edu> wrote:

Hi Rob,

Thank you once again for your help during my last dataset! I am trying to run another dataset through CFF and have been getting this error. I had one consolidated sequencing file, so wrote this script to separate sequences per sample (see attached for the script if helpful). I am also pasting a several sequences to this email below.

Anything stand out to you that I am something wrong? Thank you so much again for your help, Koji


[kyasuda@hutlab3 ben_CFF]$ ls 100.fna 101.fna 102.fna 103.fna 104.fna 105.fna 106.fna 107.fna 108.fna 88.fna 89.fna 91.fna 92.fna 94.fna 95.fna 96.fna 99.fna [kyasuda@hutlab3 ben_CFF]$ run_CFF_on_FastA.tcsh 253 outputdir “*.fna” & [1] 27134 [kyasuda@hutlab3 ben_CFF]$

RUNNING run_CFF_on_FastA.tcsh

Start time: Mon Oct 19 15:24:36 EDT 2015 Trim length: 253 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: outputdir

mergeSeqs.pl '“_.fna”' -f 'global_library.fna' --outdir 'outputdir/2lib' -o .lib -b 253 -p ''ERROR1: Unable to open input file: [“.fna”]. No such file or directory

Done. EXIT STATUS: [ERRORS: 1 WARNINGS: 0 TIME: 0s] Scroll up to inspect full errors/warnings in-place. Supply --verbose for extended run report. -- 0 seconds neighbors.pl 'outputdir/1_lib/global_library.fna' -o .nbrsERROR1: Unable to open input file: [outputdir/1_lib/global_library.fna]. No such file or directory

Done. EXIT STATUS: [ERRORS: 1 WARNINGS: 0 TIME: 1s] Scroll up to inspect full errors/warnings in-place. -- 1 seconds errorRates.pl 'outputdir/1_lib/global_library.fna' -n 'outputdir/1_lib/global_library.fna.nbrs' -z 2 -o .eratesERROR1: Unable to open input file: [outputdir/1_lib/global_library.fna]. No such file or directory

Done. EXIT STATUS: [ERRORS: 1 WARNINGS: 0 TIME: 0s] Scroll up to inspect full errors/warnings in-place. -- 0 seconds nZeros.pl 'outputdir/1lib/{“.fna”}.lib' -n 'outputdir/1_lib/global_library.fna.nbrs' -r 'outputdir/1_lib/global_library.fna.erates' -o .n0s --outdir 'outputdir/2_n0s'ERROR1: Unable to open input file: [outputdir/1lib/{“.fna”}.lib]. No such file or directory ERROR2: Unable to parse file [outputdir/1_lib/{“*.fna”}.lib]. Skipping.

Done. EXIT STATUS: [ERRORS: 2 WARNINGS: 0 TIME: 0s] Scroll up to inspect full errors/warnings in-place. -- 0 seconds getCandidates.pl 'outputdir/2n0s/{“.fna”}.lib.n0s' -o .cands -h 10 --outdir 'outputdir/3_cands'ERROR1: Unable to open input file: [outputdir/2n0s/{“.fna”}.lib.n0s]. No such file or directory WARNING1: No candidates found in any of the input files.

Done. EXIT STATUS: [ERRORS: 1 WARNINGS: 1 TIME: 0s] Scroll up to inspect full errors/warnings in-place. -- 0 seconds getReals.pl -i 'outputdir/3cands/{“.fna”}.lib.n0s.cands' -n 'outputdir/2n0s/{“.fna”}.lib.n0s' -f 'outputdir/1_lib/global_library.fna' -k 2 --outdir 'outputdir/4_reals_table'ERROR1: Too few candidates files (-i) supplied. The number of minimum candidacies (-k): [2] requires at least as many files supplied to each of the -i and -n (backwards-compatible with -d) options. I.e. -i requires at least [2] files and -n requires at least [2] files. If you only have 1 of each file, then this script should not be applied unless you set -k to 1, which will allow you to filter for chimeras at least.

ERROR: Command getReals.pl failed Stop time: Mon Oct 19 15:24:37 EDT 2015 RUN TIME: 1 seconds


100_141486 100_0 M00620:36:000000000-A41B2:1:1101:16947:2270 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 GACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTCAAGTCCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCATTGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCAAATGGG 100_141487 100_1 M00620:36:000000000-A41B2:1:1101:18703:2285 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG 100_141488 100_2 M00620:36:000000000-A41B2:1:1101:17596:2301 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGAAGGGTGCAAGCGTTACTCGGAATTACTGGGCGTAAAGCGTGCGTAGGTGGTTGTTTAAGTCTGTTGTGAAAGCCCTGGGCTCAACCTGGGAACTGCAGTGGAAACTGGACGACTAGAGTGTGGTAGAGGGTAGCGGAATTCCTGGTGTAGCAGTGAAATGCGTAGAGACCAGGAGGAACATCCATGGCGAAGGCAGCTACCTGGACCAACACTGACACTGAGGCACGAAAGCGTGGGGAGCAAACAGG 100_141489 100_3 M00620:36:000000000-A41B2:1:1101:12928:2341 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG 100_141490 100_4 M00620:36:000000000-A41B2:1:1101:18558:2375 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAATACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG 100_141491 100_5 M00620:36:000000000-A41B2:1:1101:18548:2393 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG 100_141492 100_6 M00620:36:000000000-A41B2:1:1101:14243:2462 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGAAGGGGGCTAGCGTTGTTCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGACTTTTAAGTCAGGGGTGAAATCCCAGAGCTCAACTCTGGAACTGCCTTTGATACTGGAAGTCTTGAGTATGGTAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGACCATTACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG 100_141493 100_7 M00620:36:000000000-A41B2:1:1101:16768:2492 1:N:0:100 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0 TACGGAGGGTGCAAGCGTTAATCGGAGTTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG

On Aug 13, 2015, at 6:13 PM, Robert Leach > wrote [kyasuda@hutlab3 demuxed_fasta_files]$ ls xmd01m_seqs.fna xmd03s1_seqs.fna_usearch xmd06m_seqs.fna xmd14s1_seqs.fna_usearch xmd01m_seqs.fna.otus.txt xmd03s1_seqs.fna_usearch.log xmd06m_seqs.fna.otus.txt xmd14s1_seqs.fna_usearch.log xmd01m_seqs.fna_usearch xmd03s2_seqs.fna xmd06m_seqs.fna_usearch xmd14s2_seqs.fna xmd01m_seqs.fna_usearch.log xmd03s2_seqs.fna.otus.txt xmd06m_seqs.fna_usearch.log xmd14s2_seqs.fna.otus.txt xmd01s1_seqs.fna xmd03s2_seqs.fna_usearch xmd06s1_seqs.fna xmd14s2_seqs.fna_usearch xmd01s1_seqs.fna.otus.txt xmd03s2_seqs.fna_usearch.log xmd06s1_seqs.fna.otus.txt xmd14s2_seqs.fna_usearch.log xmd01s1_seqs.fna_usearch xmd04m_seqs.fna xmd06s1_seqs.fna_usearch xmd21m_seqs.fna xmd01s1_seqs.fna_usearch.log xmd04m_seqs.fna.otus.txt xmd06s1_seqs.fna_usearch.log xmd21m_seqs.fna.otus.txt xmd01s2_seqs.fna xmd04m_seqs.fna_usearch xmd06s2_seqs.fna xmd21m_seqs.fna_usearch xmd01s2_seqs.fna.otus.txt xmd04m_seqs.fna_usearch.log xmd06s2_seqs.fna.otus.txt xmd21m_seqs.fna_usearch.log xmd01s2_seqs.fna_usearch xmd04s1_seqs.fna xmd06s2_seqs.fna_usearch xmd21s1_seqs.fna xmd01s2_seqs.fna_usearch.log xmd04s1_seqs.fna.otus.txt xmd06s2_seqs.fna_usearch.log xmd21s1_seqs.fna.otus.txt xmd02m_seqs.fna xmd04s1_seqs.fna_usearch xmd07m_seqs.fna xmd21s1_seqs.fna_usearch xmd02m_seqs.fna.otus.txt xmd04s1_seqs.fna_usearch.log xmd07m_seqs.fna.otus.txt xmd21s1_seqs.fna_usearch.log xmd02m_seqs.fna_usearch xmd04s2_seqs.fna xmd07m_seqs.fna_usearch xmd21s2_seqs.fna xmd02m_seqs.fna_usearch.log xmd04s2_seqs.fna.otus.txt xmd07m_seqs.fna_usearch.log xmd21s2_seqs.fna.otus.txt xmd02s1_seqs.fna xmd04s2_seqs.fna_usearch xmd07s1_seqs.fna xmd21s2_seqs.fna_usearch xmd02s1_seqs.fna.otus.txt xmd04s2_seqs.fna_usearch.log xmd07s1_seqs.fna.otus.txt xmd21s2_seqs.fna_usearch.log xmd02s1_seqs.fna_usearch xmd05m_seqs.fna xmd07s1_seqs.fna_usearch xmd28m_seqs.fna xmd02s1_seqs.fna_usearch.log xmd05m_seqs.fna.otus.txt xmd07s1_seqs.fna_usearch.log xmd28m_seqs.fna.otus.txt xmd02s2_seqs.fna xmd05m_seqs.fna_usearch xmd07s2_seqs.fna xmd28m_seqs.fna_usearch xmd02s2_seqs.fna.otus.txt xmd05m_seqs.fna_usearch.log xmd07s2_seqs.fna.otus.txt xmd28m_seqs.fna_usearch.log xmd02s2_seqs.fna_usearch xmd05s1_seqs.fna xmd07s2_seqs.fna_usearch xmd28s1_seqs.fna xmd02s2_seqs.fna_usearch.log xmd05s1_seqs.fna.otus.txt xmd07s2_seqs.fna_usearch.log xmd28s1_seqs.fna.otus.txt xmd03m_seqs.fna xmd05s1_seqs.fna_usearch xmd14m_seqs.fna xmd28s1_seqs.fna_usearch xmd03m_seqs.fna.otus.txt xmd05s1_seqs.fna_usearch.log xmd14m_seqs.fna.otus.txt xmd28s1_seqs.fna_usearch.log xmd03m_seqs.fna_usearch xmd05s2_seqs.fna xmd14m_seqs.fna_usearch xmd28s2_seqs.fna xmd03m_seqs.fna_usearch.log xmd05s2_seqs.fna.otus.txt xmd14m_seqs.fna_usearch.log xmd28s2_seqs.fna.otus.txt xmd03s1_seqs.fna xmd05s2_seqs.fna_usearch xmd14s1_seqs.fna xmd28s2_seqs.fna_usearch xmd03s1_seqs.fna.otus.txt xmd05s2_seqs.fna_usearch.log xmd14s1_seqs.fna.otus.txt xmd28s2_seqs.fna_usearch.log [kyasuda@hutlab3 demuxed_fasta_files]$ pwd /n/hutlab12_nobackup/data/saliva/input/demuxed_fasta_files [kyasuda@hutlab3 demuxed_fasta_files]$ /n/huttenhower_lab_nobackup/tools/CFF/SOURCE_THIS -bash: /n/huttenhower_lab_nobackup/tools/CFF/SOURCE_THIS: Permission denied [kyasuda@hutlab3 demuxed_fasta_files]$ source /n/huttenhower_lab_nobackup/tools/CFF/SOURCE_THIS [kyasuda@hutlab3 demuxed_fasta_files]$ run_CFF_on_FastA.tcsh 250 outputdir “xmd*_seqs.fna” ## RUNNING run_CFF_on_FastA.tcsh Start time: Thu Dec 17 08:53:37 EST 2015 Trim length: 250 Z-score threshold: 2 Magnitude over N0 Threshold: 10 Nominations threshold: 2 OUTPUT DIRECTORY: outputdir mergeSeqs.pl '“xmd__seqs.fna”' -f 'global_library.fna' --outdir 'outputdir/2_lib' -o .lib -b 250 -p ''ERROR1: Unable to open input file: [“xmd__seqs.fna”]. No such file or directory