smangul1 / MiCoP

MiCoP is a method for high-accuracy profiling of viral and fungal metagenomic communities.
GNU General Public License v3.0
14 stars 9 forks source link

How to update the database #11

Open wfgui opened 2 years ago

wfgui commented 2 years ago

Hi, I want to update the database to get more accurate results. Can I update it?

Thanks.

magibc commented 2 years ago

Very interesting question @wfgui. It exists this possibility @smangul1 ?

Thanks on advance,

Magi.

dkoslicki commented 2 years ago

Since the method is predominately BWA, you should be able to simply use BWA index to build a new reference database. In addition, you will need accession2info.txt files similar to this, which should be doable with the NCBI taxdump (or similar if you're using a different taxonomy).

magibc commented 2 years ago

Dear @dkoslicki

Thanks for your response. It will be my first time to try to construct a reference database.

I write another time if you could give me some specific hints to achieve it using the last release of NCBI (release 213).

1) I understand that I have to retrieve all fna files from https://ftp.ncbi.nih.gov/refseq/release/viral/ that appear the NC code as the accession2info-viral.txt. In contrast, in https://ftp.ncbi.nlm.nih.gov/genomes/refseq/ appears the assembly accession (starting with GCF) but not the NCBI ID (starting with NC) as the other link. Once downloaded all NCBI Refseq viral, I will concatenate all fna fasta files prior to index through BWA software. Is it ok this workflow?

2) In relation to the second part (accession2info.txt creation) I searched for NCBI taxdump where there are names.dmp, nodes.dmp (files that I have used in Kraken2 for shotgun taxonomic classification) but I am not capable to see how to construct from this dmp files the accession2info.txt. Also the assembly summary.txt only appears accession number and species but not all the taxonomic ranks (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Aaosphaeria_arxii/assembly_summary.txt ).

3) In the accession2info-viral.txt, the first column is the NCBI code (starting with NC), the third column is the NCBI TaxID associated, but what is the value of second column?

Thanks for your time and help. I would really appreciate your comments to try to overcome and try to update the database used in the MiCoP.

nlapier2 commented 2 years ago

@wfgui Thanks for your interest in MiCoP.

1) I think this workflow should be fine.

2) The full taxonomic lineages should be within names.dmp or nodes.dmp. I can't remember which one but if you unpack everything and browse the files, one of them has the complete lineage info for any accession or taxid.

3) The second column is the genomic length corresponding to that accession.