qichao1984 / NCyc

42 stars 22 forks source link

Format issue with -si file #15

Closed bgregs94 closed 3 years ago

bgregs94 commented 3 years ago

Hi,

I have a similar problem to issue #11. I am running NCyc on two samples as a test. These files are called sample31merged.fastq.gz (127189806 reads)and sample32merged.fastq.gz (191219006 reads). I run the following command:

perl NCycProfiler.PL -d ~/Dorset_mesocosms/ProjectLIMS19618/CLEAN_READS2/MERGED_READS/ncyctest/ -m diamond -f fastq.gz -s nucl -si ~/Dorset_mesocosms/ProjectLIMS19618/CLEAN_READS2/MERGED_READS/ncyctest/samples.tsv -o ~/nitro_genes.tsv

After running I get the following:

Deallocating buffers... [0.043s] Deallocating queries... [0.004s] Loading query sequences... [0s] Closing the input file... [0.018s] Closing the output file... [0.132s] Closing the database file... [0.01s] Deallocating taxonomy... [0s] Total time = 5829.35s Reported 3850528 pairwise alignments, 3850528 HSPs. 3850528 queries aligned. The host system is detected to have 1621 GB of RAM. It is recommended to increase the block size for better performance using these parameters : -b12 -c1 was not found in /home/bgregs/Dorset_mesocosms/ProjectLIMS19618/CLEAN_READS2/MERGED_READS/ncyctest/samples.tsv, please check!

Diamond appears to successfully run generating sample31merged.diamond and sample32merged.diamond files but then I get the above error and there is no further output. This seems to be an issue with reading the sample information file. I've tried a few different formats with the sample info file. I have tried it with (samples.tsv) or without (samples2.tsv) the fastq.gz extension as recommended in issue #14 and without the full file path (sampleinfo.tsv) but I get the same error regardless. I have attached these files as .txt files as I can't upload tsv to GitHub:

sampleinfo.txt samples2.txt samples.txt

The only changes I have made to the script is path to diamond:

Please specify where your prefered database searching tool locates

NCBI blast ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/blast-2.2.26-x64-linux.tar.gz

my $blast = "~/bin/blast/bin/blastall"; my $formatdb = "~/bin/blast/bin/formatdb";

diamond https://github.com/bbuchfink/diamond/releases

my $diamond = "/usr/local/bin/diamond2";

usearch https://www.drive5.com/usearch/download.html

my $usearch = "~/bin/usearch8.1.1861_i86linux32";

Any advice to sort out this issue would be appreciated.

qichao1984 commented 3 years ago

Hi, the correct format should be something like “sample31merged 127189806“. However, as you can see, your sampleinfo.txt contains bad windows EOL symbols “^M”. The file should be converted to Linux format.

[cid:image001.png@01D6CBD0.6922C210]

发送自 Windows 10 版邮件https://go.microsoft.com/fwlink/?LinkId=550986应用

发件人: bgregs94mailto:notifications@github.com 发送时间: Saturday, December 5, 2020 1:10 AM 收件人: qichao1984/NCycmailto:NCyc@noreply.github.com 抄送: Subscribedmailto:subscribed@noreply.github.com 主题: [qichao1984/NCyc] Format issue with -si file (#15)

Hi,

I have a similar problem to issue #11https://github.com/qichao1984/NCyc/issues/11. I am running NCyc on two samples as a test. These files are called sample31merged.fastq.gz (127189806 reads)and sample32merged.fastq.gz (191219006 reads). I run the following command:

perl NCycProfiler.PL -d ~/Dorset_mesocosms/ProjectLIMS19618/CLEAN_READS2/MERGED_READS/ncyctest/ -m diamond -f fastq.gz -s nucl -si ~/Dorset_mesocosms/ProjectLIMS19618/CLEAN_READS2/MERGED_READS/ncyctest/samples.tsv -o ~/nitro_genes.tsv

After running I get the following:

Deallocating buffers... [0.043s] Deallocating queries... [0.004s] Loading query sequences... [0s] Closing the input file... [0.018s] Closing the output file... [0.132s] Closing the database file... [0.01s] Deallocating taxonomy... [0s] Total time = 5829.35s Reported 3850528 pairwise alignments, 3850528 HSPs. 3850528 queries aligned. The host system is detected to have 1621 GB of RAM. It is recommended to increase the block size for better performance using these parameters : -b12 -c1 was not found in /home/bgregs/Dorset_mesocosms/ProjectLIMS19618/CLEAN_READS2/MERGED_READS/ncyctest/samples.tsv, please check!

Diamond appears to successfully run generating sample31merged.diamond and sample32merged.diamond files but then I get the above error and there is no further output. This seems to be an issue with reading the sample information file. I've tried a few different formats with the sample info file. I have tried it with (samples.tsv) or without (samples2.tsv) the fastq.gz extension as recommended in issue #14https://github.com/qichao1984/NCyc/issues/14 and without the full file path (sampleinfo.tsv) but I get the same error regardless. I have attached these files as .txt files as I can't upload tsv to GitHub:

sampleinfo.txthttps://github.com/qichao1984/NCyc/files/5644134/sampleinfo.txt samples2.txthttps://github.com/qichao1984/NCyc/files/5644135/samples2.txt samples.txthttps://github.com/qichao1984/NCyc/files/5644136/samples.txt

The only changes I have made to the script is path to diamond:

Please specify where your prefered database searching tool locates

NCBI blast ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/blast-2.2.26-x64-linux.tar.gz

my $blast = "/bin/blast/bin/blastall"; my $formatdb = "/bin/blast/bin/formatdb";

diamond https://github.com/bbuchfink/diamond/releases

my $diamond = "/usr/local/bin/diamond2";

usearch https://www.drive5.com/usearch/download.html

my $usearch = "~/bin/usearch8.1.1861_i86linux32";

Any advice to sort out this issue would be appreciated.

― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/15, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGFNSICNMJKCYZIEYR3STEJZBANCNFSM4UNWLJRQ.

bgregs94 commented 3 years ago

Hi,

I am still getting the same error as above. I converted sampleinfo.txt with: dos2unix sampleinfo.txt I also checked the format with: od -c sampleinfo.txt and cat -e sampleinfo.txt I have attached the sample info file I used. Is there still an issue with the format or another error in it?

sampleinfo-2.txt

qichao1984 commented 3 years ago

Is there any file named “.diamond” (not *.diamond) in your directory? If so, delete it.

发件人: bgregs94mailto:notifications@github.com 发送时间: Monday, December 7, 2020 4:22 PM 收件人: qichao1984/NCycmailto:NCyc@noreply.github.com 抄送: Qichao Tumailto:philloid@gmail.com; Commentmailto:comment@noreply.github.com 主题: Re: [qichao1984/NCyc] Format issue with -si file (#15)

Hi,

I am still getting the same error as above. I converted sampleinfo.txt with: dos2unix sampleinfo.txt I also checked the format with: od -c sampleinfo.txt and cat -e sampleinfo.txt I have attached the sample info file I used. Is there still an issue with the format or another error in it?

sampleinfo-2.txthttps://github.com/qichao1984/NCyc/files/5651349/sampleinfo-2.txt

― You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/15#issuecomment-739756151, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGBVRFYF4ZHW6VCRMPTSTSGEPANCNFSM4UNWLJRQ.

qichao1984 commented 3 years ago

Alternatively, put your fastq.gz files in the same directory with NCycProfiler.PL. It could be that NCycProfiler.PL failed in finding *.diamond files.

发件人: Qichao Tumailto:philloid@gmail.com 发送时间: Monday, December 7, 2020 6:04 PM 收件人: qichao1984/NCycmailto:reply@reply.github.com 主题: 回复: [qichao1984/NCyc] Format issue with -si file (#15)

Is there any file named “.diamond” (not *.diamond) in your directory? If so, delete it.

发件人: bgregs94mailto:notifications@github.com 发送时间: Monday, December 7, 2020 4:22 PM 收件人: qichao1984/NCycmailto:NCyc@noreply.github.com 抄送: Qichao Tumailto:philloid@gmail.com; Commentmailto:comment@noreply.github.com 主题: Re: [qichao1984/NCyc] Format issue with -si file (#15)

Hi,

I am still getting the same error as above. I converted sampleinfo.txt with: dos2unix sampleinfo.txt I also checked the format with: od -c sampleinfo.txt and cat -e sampleinfo.txt I have attached the sample info file I used. Is there still an issue with the format or another error in it?

sampleinfo-2.txthttps://github.com/qichao1984/NCyc/files/5651349/sampleinfo-2.txt

― You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/15#issuecomment-739756151, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGBVRFYF4ZHW6VCRMPTSTSGEPANCNFSM4UNWLJRQ.

bgregs94 commented 3 years ago

Hi, I am still getting the same error. I've made sure there are no diamond files in the directory before I run the command. I have also moved the fastq.gz files to the same directory with the NCycProfiler.PL script. I now run the following command with the reformatted txt file:

perl NCycProfiler.PL -d ~/ -m diamond -f fastq.gz -s nucl -si ~/sampleinfo.txt -o ~/nitro_genes.tsv

Is there anything else I can try?

yimingma1207 commented 3 years ago

Hi, I'm having the same problem.The code I run is as follows: perl /home/mym/NCycDB/diamond/NCycProfiler_3.PL -d ~/ -m diamond -f fasta -s nucl -si /home/mym/NCycDB/diamond/DOC1-3.tsv -o try-sample.tsv

I put the .PL file and sequences(.fasta) in the same folder as described earlier in the post. -si input .tsv file also use the Unix format, but the following error still occurs.

diamond v2.0.6.144 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org

CPU threads: 32

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Database input file: /home/mym/anaconda3/envs/ncyc/bin/formatdb/data/NCyc_100_2019Jul.faa Opening the database file... [0.023s] Loading sequences... [0.397s] Masking sequences... [0.453s] Writing sequences... [0.089s] Hashing sequences... [0.031s] Loading sequences... [0s] Writing trailer... [0.001s] Closing the input file... [0s] Closing the database file... [3.07s] Database hash = 3b585324bcb2da4bd3127ca50767ae79 Processed 273501 sequences, 101944919 letters. Total time = 4.069s was not found in /home/mym/NCycDB/diamond/DOC1-3.tsv, please check!

The perl code where the error is located looks like this, line 118-130 in the PL file: my %size; my @sizes; open( FILE, "$sampleinfo" ) || die "#3\n"; while () { chomp; my @items = split( "\t", $_ ); $size{ $items[0] } = $items[1]; push( @sizes, $items[1] ); } close FILE; foreach my $sample(keys %samples){ die "$sample was not found in $sampleinfo, please check!\n" if !$size{$sample}; }

Is there something I can do to make the program work? Thank you very much!

This is the sample info file I used: DOC1-3.txt

@qichao1984

qichao1984 commented 3 years ago

Seems due to linux path issues that the script failed finding any diamond file in the current directory. To test, you may try:

cd /home/mym/NCycDB/diamond/ perl NCycProfiler_3.PL -d . -m diamond -f fasta -s nucl -si DOC1-3.tsv -o try-sample.tsv


From: yimingma1207 notifications@github.com Sent: Thursday, December 17, 2020 11:02:49 PM To: qichao1984/NCyc NCyc@noreply.github.com Cc: Qichao Tu philloid@gmail.com; Mention mention@noreply.github.com Subject: Re: [qichao1984/NCyc] Format issue with -si file (#15)

Hi, I'm having the same problem.The code I run is as follows: perl /home/mym/NCycDB/diamond/NCycProfiler_3.PL -d ~/ -m diamond -f fasta -s nucl -si /home/mym/NCycDB/diamond/DOC1-3.tsv -o try-sample.tsv

I put the .PL file and sequences(.fasta) in the same folder as described earlier in the post. -si input .tsv file also use the Unix format, but the following error still occurs.

diamond v2.0.6.144 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org

CPU threads: 32

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Database input file: /home/mym/anaconda3/envs/ncyc/bin/formatdb/data/NCyc_100_2019Jul.faa Opening the database file... [0.023s] Loading sequences... [0.397s] Masking sequences... [0.453s] Writing sequences... [0.089s] Hashing sequences... [0.031s] Loading sequences... [0s] Writing trailer... [0.001s] Closing the input file... [0s] Closing the database file... [3.07s] Database hash = 3b585324bcb2da4bd3127ca50767ae79 Processed 273501 sequences, 101944919 letters. Total time = 4.069s was not found in /home/mym/NCycDB/diamond/DOC1-3.tsv, please check!

The perl code where the error is located looks like this, line 118-130 in the PL file: my %size; my @sizeshttps://github.com/sizes; open( FILE, "$sampleinfo" ) || die "#3https://github.com/qichao1984/NCyc/issues/3\n"; while () { chomp; my @Itemshttps://github.com/Items = split( "\t", $_ ); $size{ $items[0] } = $items[1]; push( @sizeshttps://github.com/sizes, $items[1] ); } close FILE; foreach my $sample(keys %samples){ die "$sample was not found in $sampleinfo, please check!\n" if !$size{$sample}; }

Is there something I can do to make the program work? Thank you very much!

This is the sample info file I used: DOC1-3.txthttps://github.com/qichao1984/NCyc/files/5710202/DOC1-3.txt

@qichao1984https://github.com/qichao1984

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/15#issuecomment-747491523, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGGTIRFCWR2GC7AEOLLSVIMRTANCNFSM4UNWLJRQ.

yimingma1207 commented 3 years ago

Hi, Thank you very much for your help!!! The problem is solved and the program is running!