jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
372 stars 80 forks source link

Formatting external database #717

Closed mradz19 closed 1 year ago

mradz19 commented 1 year ago

I am attempting to use the non-redunandant protein database from Refseq as an external database but i am having some issues. I have formatted my reference FASTA file to this format:

>WP_241403706.1|heme-binding protein
MFVSGRSLHSAAIGAICAGGMLFGGAAVASAEPPPPPPPNCTAGDLAVASGTVGTAMGAYLFSHPDVNNFFTSLRGLPHEEVRGRVQTYMDANPQVETEINGIRQPLTDVRNRCDVPEPLGS
>WP_241403737.1|3-hydroxyacyl-CoA dehydrogenase
MQIKDAVAVVTGGASGLGLATTKRLLDAGASVVVIDLKGEEVVAELGDRAKFVATNVTDEDGVSKALDVAESLGPLRINVNCAGIGNAIKTLGKDGPFPLDGFKKVVEVNLIGTFNVLRLAAERIAKNEPLGEERGVIINTASVAAFEGQIGQAAYSASKGGVVGMTLPIARDLSRSLIRVCTIAPGLFKTPLLGSLPEEAQKSLGQQVPHPARLGDPDEYGALAVHIVENAMLNGEVIRLDGAIRMAPR
>WP_241403735.1|enoyl-CoA hydratase
MSTTESYTGIDDLTVSLSDGVLSMTLNRPDSLNSLTAAMLSTITSTMERAGDDPAVRVVRLGGAGRGFSSGAGIGAEDRANPGASGAPGDVLEAANRAVSAIISSPKPVVSVIQGPTAGVGVSLAIAADIILASETAYFLLAFTKIGLMPDGGASALVAASIGRTRAMRMALLAERLSAADALTAGLVSAVYPADDLDAGVDAVLARLKSGPAVALRKTKHAINAATLTELDAAFGRETEGQMTLLTAKDFHEGAMAFQERRAPTFTDD
>WP_241403748.1|PPOX class F420-dependent oxidoreductase
MTRHVLDDKLLAVISGNSLGILATIKRDGRPQLSNVSYHFDSRNLAIQVSVREPLAKTRNLRRDPRASVHVPSDDGWAYAVAEGDAILTAPAAAPDDDTVEALIALYRNIAGEHPDWDDYRRAMVDDRRVLLTLPISHLYGLPPGIR

However when doing the run I see that the output following DIAMOND in step 7 (07.test_run.fun3.Refseq) looks like this:

# Created by /data/anaconda3/envs/SqueezeMeta_new/SqueezeMeta/scripts/07.fun3assign.pl for Refseq, Wed Jul 26 04:43:39 2023, evalue=, miniden=, minolap=30
#ORF    BESTHIT BESTAVER
megahit_1_1-132 helix-turn-helix        helix-turn-helix
megahit_1_226-2358  PBP1A   PBP1A
megahit_1_2521-3732     tyrosine--tRNA  tyrosine--tRNA
megahit_1_3766-5943     hypothetical    hypothetical
megahit_1_6019-6921     glucose-1-phosphate     glucose-1-phosphate
megahit_1_6923-7822     dTDP-4-dehydrorhamnose  dTDP-4-dehydrorhamnose
megahit_1_7824-8891     dTDP-glucose    dTDP-glucose
megahit_1_8891-9736     LicD    LicD
megahit_1_9949-11739    NTP     NTP
megahit_1_11758-12705   EamA

I can see that words are duplicated in some cases the functional description has been reduced to a few letters. Is this because of the spaces in the functional descriptions of the fasta file?

jtamames commented 1 year ago

Hello The shortening of the names, yes, it is because white spaces. Diamond formats its databases in that way, with blank spaces meaning end of header. So you better replace white spaces with underscore symbols. Regarding the annotations, they are not duplicated. Second column is the classification based on the best hit, third is based on the best average. That's why sometimes there is a best hit but not a best average. Please refer to the manual for details on how this works. Best, J