fasta file identifier - Githubissues

ncbi / amr

AMRFinderPlus - Identify AMR genes and point mutations, and virulence and stress resistance genes in assembled bacterial nucleotide and protein sequence.

https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/

Other

265 stars 37 forks source link

fasta file identifier #136

Closed Fla1487 closed 8 months ago

Fla1487 commented 8 months ago

Dear All, I have noted that AMRFinderPlus has a problem when applied to .fasta files where each contigs is named with the same identifiers (derived from Ridom). Below an example:

sequence ID #1 AAACCCGGG sequence ID #2 CCCGGAA

This problem does not exist when .fasta file show (derived from unicycler):

1 ACCCCGGGTT 2 GGCCTTAAA

Do you have any suggestions?

Many thanks

vbrover commented 8 months ago

FASTA is a file like this:

>id1
AAACCCGGG
>id2
CCCGGAA

Each sequence must have a unique identifier in order to be uniquely identified.

https://en.wikipedia.org/wiki/FASTA_format:

The description line (defline) or header/identifier line, which begins with ">", gives a name and/or a unique identifier for the sequence,

What is Ridom?

evolarjun commented 8 months ago

Hi @Fla1487,

I'm not sure why Riddom is naming sequences like that, it seems highly unconventional. As Slava said above AMRFinderPlus requires a unique sequence identifier to be able to identify the sequences and make sure it can report what gene/point mutation came from what contig and where.

If you just want to get results, here's a perl one-liner to append a number to each identifier to make sure they're unique:

perl -pe 's/>(\w+)/">$1" . ++$i/e'  file.fasta > file.unique_ids.fasta

You could then run AMRFinderPlus on file.unique_ids.fasta.

Hope that helps, Arjun

Fla1487 commented 8 months ago

FASTA is a file like this:
>id1
AAACCCGGG
>id2
CCCGGAA
Each sequence must have a unique identifier in order to be uniquely identified.

https://en.wikipedia.org/wiki/FASTA_format:
The description line (defline) or header/identifier line, which begins with ">", gives a name and/or a unique identifier for the sequence,
What is Ridom?

Thank you for you replay. RidomSeqSphere is a GUI. In the past I used to analyze .fasta files, but now I ahve noted this problem that it is absent when I produce .fasta files with spades/unicycler by using command line.

I have noted the differences in header identifiers, as well as a row between each contig and a format of the sequence.

Thank you again

Fla1487 commented 8 months ago

Hi @Fla1487,

I'm not sure why Riddom is naming sequences like that, it seems highly unconventional. As Slava said above AMRFinderPlus requires a unique sequence identifier to be able to identify the sequences and make sure it can report what gene/point mutation came from what contig and where.

If you just want to get results, here's a perl one-liner to append a number to each identifier to make sure they're unique:
perl -pe 's/>(\w+)/">$1" . ++$i/e'  file.fasta > file.unique_ids.fasta
You could then run AMRFinderPlus on file.unique_ids.fasta.

Hope that helps, Arjun

Thank you Arjun, Now amrfinderplus works without problem. Actually, I do not know why Ridom generates fasta files with this format (with the previous version I worked without problems).

thank you for your help