neherlab / pangraph

A bioinformatic toolkit to align genome assemblies into pangenome graphs
https://neherlab.github.io/pangraph
MIT License
77 stars 7 forks source link

check input fasta file for records with same name #47

Closed mmolari closed 1 year ago

mmolari commented 1 year ago

If the input fasta file contains records with the same name, pangraph does not fail but the second duplicated records will not be present in the output pangraph. A simple preliminary check could be added to make sure that no records with the same name exist, and make pangraph fail if they do.

ivan-aksamentov commented 1 year ago

@mmolari Duplicate names have been a small disaster in the field.

We found in other projects that we could identify sequences uniquely by their index in the fasta file, instead of name. This way you don't need to use names for lookups and so duplicate names are just passed through as any other attributes, without causing any troubles.

Not sure if it is applicable here but I though I'd share my experience.

mmolari commented 1 year ago

This makes sense for most things. The only problem that would remain with this approach is that in the output file we have to connect occurrences of blocks (i.e. lines of an alignment) to their corresponding input fasta sequence, and the way we do it is using the names of the output. If these names are the same then the output would be ambiguous as well... but on second thought maybe this is not too bad, the problem was that the input file was ambiguous from the start, it's not really in pangraph's processing.