When I extract FASTA from highquality_clust30 I receive the following headers.
>ESMFOLD V0 PREDICTION FOR MGYP000138429313
>ESMFOLD V0 PREDICTION FOR MGYP001595280761
...
I use FoldComp for a downstream application, and per FASTA specification in this case each sequence will have a header ESMFOLD, which is not unique. The unique id is stored in the comment.
I can run sed on it, but this solution feels hacky.
The highquality_clust30.lookup looks appropriate:
0 MGYP002174220927 0
1 MGYP000064029927 0
Do you have recommendations on how to get proper FASTA headers?
Sorry for the late response. I've changed the default to use id/filename when extracting sequences in 412c7a8 and introduced use-title flag if title is needed.
When I extract
FASTA
fromhighquality_clust30
I receive the following headers.I use
FoldComp
for a downstream application, and per FASTA specification in this case each sequence will have a headerESMFOLD
, which is not unique. The uniqueid
is stored in the comment. I can runsed
on it, but this solution feels hacky. Thehighquality_clust30.lookup
looks appropriate:Do you have recommendations on how to get proper FASTA headers?
Cheers V