qichao1984 / NCyc

42 stars 22 forks source link

Where did my single hit of amoA_B go when my samples concatenated? #19

Open MaryoHg opened 3 years ago

MaryoHg commented 3 years ago

Dear @qichao1984,

Thanks heaps for this DB.

We observed the following when analyzing our samples. We have 10 assemblies (*.fa files) with different numbers of contigs. I ran the NcyPerl.pl on my 10 assemblies (using the below command) and the obtained results contained a single hit with amoA_B in one sample (Figure 1).

$ NCycProfiler.pl -d ./ -m diamond -f fa -s nucl -si si.txt -rs 33175 -o out_file 2>&1 | tee log.run1

Figure 1: output file using all samples (n = 10) at the same depth (random sampling to 33,175) rarefying_samples

However, after running NCycProfiler.pl on all my assemblies concatenated (sum of all contigs, around half a million), that hit is gone (Fig. 2). I cat my *.fa and run the script as shown below.

$ cat $(ls *.fa) >> cat_assembly.fa $ NCycProfiler.pl -d ./ -m diamond -f fa -s nucl -si si.txt -o out_file 2>&1 | tee log.cat

Fig. 2. Output file using all samples (n = 10) concatenated (sum of all contigs) concatenated_assemblies

We appreciate your time and response to this. Greatings,

MaryoHg.

qichao1984 commented 3 years ago

This is quite weird. Did you use id2gene.map.2019Julhttps://github.com/qichao1984/NCyc/blob/master/data/id2gene.map.2019Jul instead of id2gene.maphttps://github.com/qichao1984/NCyc/blob/master/data/id2gene.map, and revise the script accordingly? Thanks!

发件人: Maryomailto:notifications@github.com 发送时间: Monday, March 1, 2021 11:56 PM 收件人: qichao1984/NCycmailto:NCyc@noreply.github.com 抄送: Qichao Tumailto:philloid@gmail.com; Mentionmailto:mention@noreply.github.com 主题: [qichao1984/NCyc] Where did my single hit of amoA_B go when my samples concatenated? (#19)

Dear @qichao1984https://github.com/qichao1984,

Thanks heaps for this DB.

We observed the following when analyzing our samples. We have 10 assemblies (*.fa files) with different numbers of contigs. I ran the NcyPerl.pl on my 10 assemblies (using the below command) and the obtained results contained a single hit with amoA_B in one sample (Figure 1).

$ NCycProfiler.pl -d ./ -m diamond -f fa -s nucl -si si.txt -rs 33175 -o out_file 2>&1 | tee log.run1

Figure 1: output file using all samples (n = 10) at the same depth (random sampling to 33,175) [rarefying_samples]https://user-images.githubusercontent.com/39426178/109520793-38a30400-7a72-11eb-9481-e43fdbf3e19a.png

However, after running NCycProfiler.pl on all my assemblies concatenated (sum of all contigs, around half a million), that hit is gone (Fig. 2). I cat my *.fa and run the script as shown below.

$ cat $(ls *.fa) >> cat_assembly.fa $ NCycProfiler.pl -d ./ -m diamond -f fa -s nucl -si si.txt -o out_file 2>&1 | tee log.cat

Fig. 2. Output file using all samples (n = 10) concatenated (sum of all contigs) [concatenated_assemblies]https://user-images.githubusercontent.com/39426178/109521431-ec0bf880-7a72-11eb-857f-04853e142a78.png

We appreciate your time and response to this. Greatings,

MaryoHg.

― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/19, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGAUQ3A55A4NQL6TXM3TBO2JNANCNFSM4YMZLSBA.

MaryoHg commented 3 years ago

Thanks for the prompt reply @qichao1984.

I found that the problem was related to duplicated "fasta header names". When I concat the assemblies, many contigs showed duplicated header names. I changed them – using: $ awk '/^>/{print ">contig" ++i; next}{print}' < input.contigs.fa > output.contigs.newheaders.fa–, and then I ran the NCycProfiler.pl script on it.

The output showed the hit with amoA_B observed when random-sampling. However, other genes change their values. For instance, NR had a value of 219 (See Fig. 2 in my first post), now NR has 300 (Fig. 3, below). My guess is that not only amoA_B gene was being affected by duplicated fasta headers names. Am I right?

Fig. 3. NCycProfiler.pl script ran on the "new-header-names"-concat assembly (n=10) using same parameters.

SC-newheader-cat

Do you know how can we get the numbers of headers that changed (or increased in number) per gene? Any idea would be greatly appreciated.

Thanks heaps. Specially thanks for your time.

MaryoHg.

qichao1984 commented 3 years ago

It is normal that you get different number of hits in this case, because the number of random sampling effort is different. Usually, the higher value random sampling is, the more hits you will get. Regarding the issue of contig heads, I suggest to run gene prediction first before running NCyc. Otherwise, you may get inaccurate profiles even you manually fix the seq ID issue.


From: Maryo notifications@github.com Sent: Wednesday, March 3, 2021, 01:15 To: qichao1984/NCyc Cc: Qichao Tu; Mention Subject: Re: [qichao1984/NCyc] Where did my single hit of amoA_B go when my samples concatenated? (#19)

Thanks for the prompt reply @qichao1984https://github.com/qichao1984.

I found that the problem was related to duplicated "fasta header names". When I concat the assemblies, many contigs showed duplicated header names. I changed them – using: $ awk '/^>/{print ">contig" ++i; next}{print}' < input.contigs.fa > output.contigs.newheaders.fa–, and then I ran the NCycProfiler.pl script on it.

The output showed the hit with amoA_B observed when random-sampling. However, other genes change their values. For instance, NR had a value of 219 (See Fig. 2 in my first post), now NR has 300 (Fig. 3, below). My guess is that not only amoA_B gene was being affected by duplicated fasta headers names. Am I right?

Fig. 3. NCycProfiler.pl script ran on the "new-header-names"-concat assembly (n=10) using same parameters. [SC-newheader-cat]https://user-images.githubusercontent.com/39426178/109686454-e8967100-7b47-11eb-90dc-dae02aee7d12.png

Do you know how can we get the numbers of headers that changed (or increased in number) per gene? Any idea would be greatly appreciated.

Thanks heaps. Specially thanks for your time.

MaryoHg.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/19#issuecomment-789068050, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGFQSGGU7PZ5LNJUH33TBUMJ5ANCNFSM4YMZLSBA.

MaryoHg commented 3 years ago

@qichao1984

Thank you very much for the advice.

We usually use Prokka for ORF findings. Following your suggestion, Should I use the *.faa file from the Prokka's output, and then use that file to obtain the NCyc profile?

.faa | Protein FASTA file of the translated CDS sequences. (Source)

Thanks.

qichao1984 commented 3 years ago

Yes, Prokka's output faa file should be fine.


From: Maryo notifications@github.com Sent: Wednesday, March 3, 2021, 08:14 To: qichao1984/NCyc Cc: Qichao Tu; Mention Subject: Re: [qichao1984/NCyc] Where did my single hit of amoA_B go when my samples concatenated? (#19)

@qichao1984https://github.com/qichao1984

Thank you very much for the advice.

We usually use Prokka for ORF findings. Following your suggestion, Should I use the *.faa file from the Prokka's output, and then use that file to obtain the NCyc profile?

.faa | Protein FASTA file of the translated CDS sequences. (Sourcehttps://github.com/tseemann/prokka#output-files)

Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/19#issuecomment-789317905, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGHFB7ASKSQOSQTG2UTTBV5PFANCNFSM4YMZLSBA.

MaryoHg commented 3 years ago

Dear @qichao1984,

Thanks for your advice and for your time. I will do and report our analysis based on your suggestions. Stay healthy.

Regards, MaryoHg.