some conflicts and still can not run

marieBvr commented 3 years ago

poursalavati commented 3 years ago

Thank you, dear Marie, for your help and update.

I tried the slurm version, and unfortunately, it has some conflict yet.

For example, in loadTaxonomy.pl, line number 272, there is unnecessary space before EOF. After removing them, the script is running. But again, create an 80 kb SQL database. Also, I have enough free space in our HPC.

This is the summary of running:

>virAnnot/slurm/db$ ./loadTaxonomy.pl -struct taxonomyStructure.sql -index taxonomyIndex.sql -acc_prot acc2taxid.prot -acc_nucl acc2taxid.nucl -names names.dmp -nodes nodes.dmp -gi_prot gi_taxid_prot.dmp -acc_wgs acc2taxid.nucl -dead_prot dead_prot.accession2taxid -dead_nucl dead_nucl.accession2taxid
2021/07/12 21:36:03  INFO> loadTaxonomy.pl:122 main::_create_sqlite_db - Creating database.
2021/07/12 21:36:05  INFO> loadTaxonomy.pl:78 main::_insertingCSVDataInDatabase - Inserting tables into database...
2021/07/12 21:36:05  INFO> loadTaxonomy.pl:80 main::_insertingCSVDataInDatabase - prot_accession2taxid
2021/07/12 21:36:05  INFO> loadTaxonomy.pl:80 main::_insertingCSVDataInDatabase - nucl_accession2taxid
2021/07/12 21:36:05  INFO> loadTaxonomy.pl:80 main::_insertingCSVDataInDatabase - gi_prot
2021/07/12 21:36:05  INFO> loadTaxonomy.pl:80 main::_insertingCSVDataInDatabase - names
2021/07/12 21:36:05  INFO> loadTaxonomy.pl:80 main::_insertingCSVDataInDatabase - nodes

What's your suggestion. How can we handle it and import these data to the SQL and go to the next step?

marieBvr commented 3 years ago

Hi Naser,

Sorry for my late reply, I have been trying to reproduce your issue but without success. I found out that the ftp://ftp.ncbi.nih.gov/pub/taxonomy/obsolete/gi_taxid_prot.dmp.gz link is no longer available because it is deprecated.

I will try to update the ./loadTaxonomy.pl script with the new Ncbi files. This may take some time, I apologize for the inconvenience.

Sincerely yours, Marie

poursalavati commented 3 years ago

Thank you very much for redeveloping this code.

Yes, as you mentioned. NCBI has changed the structure of the database and some files.

Recently for another tool, I needed gi_taxid_nucl.dmp.gz and gi_taxid_prot.dmp.gz files and they are no longer available.

This is the way I used to extract these files from existing NCBI accession2taxid files (may need to be added to the loadTaxonomy.pl script or as a separate script before using loadTaxonomy.pl)

For extract gi_taxid_nucl.dmp.gz from acc2taxid.nucl (or from other accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/):

awk '{ print $4 " " $3}' acc2taxid.nucl > gi_taxid_nucl_temp1.dmp

tail -n +2 gi_taxid_nucl_temp1.dmp > gi_taxid_nucl_temp2.dmp

rm gi_taxid_nucl_temp1.dmp

tr ' ' \\t < gi_taxid_nucl_temp2.dmp > gi_taxid_nucl_new.dmp

rm gi_taxid_nucl_temp2.dmp

and for extract gi_taxid_prot.dmp.gz from acc2taxid.prot (or from other accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/):

awk '{ print $4 " " $3}' acc2taxid.prot > gi_taxid_prot_temp1.dmp

tail -n +2 gi_taxid_prot_temp1.dmp > gi_taxid_prot_temp2.dmp

rm gi_taxid_prot_temp1.dmp

tr ' ' \\t < gi_taxid_prot_temp2.dmp > gi_taxid_prot_new.dmp

rm gi_taxid_prot_temp2.dmp

Sincerely yours,

Naser

marieBvr / virAnnot

some conflicts and still can not run #1