qichao1984 / NCyc

45 stars 22 forks source link

Three potential inaccurate issues to be confirmed #7

Open liangjinsong opened 5 years ago

liangjinsong commented 5 years ago

Hi! While the using of NCycDB, I found some potential inaccurate issues in it. So I want to confirm them here from the author for further improvement of this useful database.

  1. There are 441 duplicated IDs (and sequence) in the file NCyc_100.faa. For your information, ten duplicated IDs are listed below: 1000565.METUNv1_00600 100226.SCO5525 1002339.HMPREF9373_0600 1002339.HMPREF9373_0712 1003195.SCAT_4357 1004836.SXCC_03526 1006006.Mcup_0918 1006551.KOX_23300 1006551.KOX_27050 1026882.MAMP_00572

  2. There are 81 duplicated IDs in the file id2gene.map, and what's more, the classification of anyone of these IDs are not exactly the same. Ten of the mentioned IDs are listed below for your information. A0A060I505 napA A0A060I505 nasA A0A0D6FVV7 narG A0A0D6FVV7 narZ A0A127KYU8 nasB A0A127KYU8 nirB A0A142WVR2 narB A0A142WVR2 nasA A0A142Y6U0 narB A0A142Y6U0 nasA A0A142YH98 narB A0A142YH98 nasA A0A145CM87 narB A0A145CM87 nasA A0A1D8AZ75 narB A0A1D8AZ75 nasA

  3. The ID list should be identical between the two files NCyc_100.faa and id2gene.map. But 54410 IDs in file NCyc_100.faa can not found in file id2gene.map, and 99288 IDs in file id2gene.map can not found in file NCyc_100.faa. The issue will cause inaccuracy when calculating the relative abundance of Nitrogen cycling genes (sub)families.

ajray34 commented 5 years ago

I also experienced issues with IDs not present in id2gene.map.

qichao1984 commented 5 years ago

Dear users, thanks for pointing out potential issues in this database! I will look into issue #1 and #2 pointed out by Liang asap. For issue #3, this was actually intended in order to reduce false positive assignments in Ncyc profiling, i.e. the IDs not showing up in id2gene.map belong to Ncyc gene homologs, not Ncyc gene families. By doing so, reads having better "blast" hit scores with Ncyc homologs will not show up in the final Ncyc profile.


From: ajray34 notifications@github.com Sent: Sunday, May 19, 2019 6:03 AM To: qichao1984/NCyc Cc: Subscribed Subject: Re: [qichao1984/NCyc] Three potential inaccurate issues to be confirmed (#7)

I also experienced issues with IDs not present in id2gene.map.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/7?email_source=notifications&email_token=ABNORGDOZVHVWTY7RQPCQULPWB4KTA5CNFSM4HLPGLQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVWWYRQ#issuecomment-493710406, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABNORGB5NJIWWHFK55WKHDDPWB4KTANCNFSM4HLPGLQA.

rprops commented 5 years ago

Any update on this issue? This could have implications for an analysis I'm currently running? Thank you

qichao1984 commented 4 years ago

Yes, changes have been made to the database. However, nothing need to be done for the script. Please feel free to use the updated database! Thanks for figuring out these issues!

Get Outlook for Androidhttps://aka.ms/ghei36


From: rprops notifications@github.com Sent: Wednesday, October 30, 2019 3:55:46 PM To: qichao1984/NCyc NCyc@noreply.github.com Cc: Qichao Tu philloid@gmail.com; Comment comment@noreply.github.com Subject: Re: [qichao1984/NCyc] Three potential inaccurate issues to be confirmed (#7)

Just checking if there has been any update? I've noticed some database changes but none that were incorporated into the script.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/qichao1984/NCyc/issues/7?email_source=notifications&email_token=ABNORGCJIXNDIHVQ3NHYFFDQRE4YFA5CNFSM4HLPGLQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECTHELY#issuecomment-547779119, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNORGDQE5OZFUS2ZQGGPTTQRE4YFANCNFSM4HLPGLQA.

rprops commented 4 years ago

Ok just checking because the perl script still has this chunk refering to the old id2gene.map file instead of id2gene.map.2019Jul:

my %id2gene;
open( FILE, "data/id2gene.map" ) || die "#1\n";
while (<FILE>) {
  chomp;
  my @items = split( "\t", $_ );
  $id2gene{ $items[0] } = $items[1];
}
close FILE;