harmslab / topiary

Python framework for doing ancestral sequence reconstruction
MIT License
33 stars 7 forks source link

All aditional sequences removed in seed-to-alignment step #36

Open lbleicher opened 1 year ago

lbleicher commented 1 year ago

Hi, I've tried to run seed-to-alignment in another computer and this time it ran to the finish, but all sequences were removed (they get flagged Keep=false) and the final file (06_alignment.fasta) only has what was originally in my list of sequences. Am I doing something wrong with my input file? It has five sequences, two from Mus Musculus (which has the two paralogs), one from Homo sapiens (which lost one of them), one from Danio Rerio and one from Saccoglossus kowalevskii (as the outgroup). I'm sending the input file.

mq_lb_test.csv

lbleicher commented 1 year ago

So, I've been trying different configurations for the seed file and perhaps I am not building it properly.

From the example (LY86/LY96) it seems I shouldn't need to add a sequence as an outgroup, so in one of our attempts we made a similar file - if I want to reconstruct ancestrals for a pair of paralogs derived from a duplication that happened during early chordate evolution, I should get a basal chordate species that already has the two paralogs and a more derived one which also has both, and that will define the scope, is that correct?

In my case, one of the paralogs is missing in humans. Does that mean I should not include the other human sequence or is it ok as long as if I identify it?

Also, looking at the manual I noticed that even though aliases would be optional they should help the program identify which are the paralogs. So I built a new file with only species containing the two paralogs, and to each of them I added all aliases for each of them on Uniprot (also including gene names). This time I do get a final alignment that has more stuff than what was already on the seed, but it seems they only managed to retrieve sequences from one of the two paralogs, but not the other.

Here's what I have in my seed file:

species,name,aliases,sequence,accession Amblyraja radiata,TTR,TTR;ATT;Transthyretin;ATTR;Prealbumin;TBPA;PALB,MFRALLLVALALSASCTSVPHHHGDHDTRCPILIKVLDALKGTPAANVQVEMLKRNEDKTWQSINTGVTGRNGELHNLTSEEHLAVGLYKLHFETGAYWSNAGQAHFHECANVAFRVASLTSDHYTIAVLLSPYSYSTTGIVMNPHET,XP_032875375.1 Amblyraja radiata,HIU,HIUase;UraH;5-hydroxyisourate hydrolase;HIU hydrolase;HIUHase;Transthyretin-related protein,MTAPRLQHIQDHLVPRKCNSWAVSMSLSPLTTHALNTAEGIPAARLPVSVHRLDNEEWKEMAKGVTNSDGRCPGLLTSQTFIPGIYKIKFATDKYWQSLERASFYPYIEVVFNISDASQKYHIALLLNPFSYTTYRGS,XP_032891577.1 Rhincodon typus,TTR,TTR;ATT;Transthyretin;ATTR;Prealbumin;TBPA;PALB,MRFTLILIVCLAAFYKGETVDHGGSDTKCPINIKVLNAVNGTPAANLRLQIYRQNADLSWQQLNTGITAINGEAHNLTTEADFIPGMYKVHFNTADYWRNLGYTPFHECVNVLWFSDPFPWVLTSSQEAPSDWFTNCEVSEFLVVSNS,XP_020365676.2 Rhincodon typus,HIU,HIUase;UraH;5-hydroxyisourate hydrolase;HIU hydrolase;HIUHase;Transthyretin-related protein,MNSLEIARTAEAGVRVDTVWSWRKTTGYVGLDQGMDIDRAVITKKANTLDLNDNLKAATKLKTGNSLHRFSAVTYSNSPDHKMSQSPLTTHILNTAQGIPAAHVAVSVVRLEDKAWREIATGITNSDGRCPGLLTSETFVPGIYKMKFAVDEYWQSLKMASFYPYIEVVFNISDACQKYHIALLLSPFSYSTYRGS,XP_048461878.1 Protopterus annectens,TTR,TTR;ATT;Transthyretin;ATTR;Prealbumin;TBPA;PALB,MAKALYLLFLAGVVLLSDGAPVDHGTAEKSCPLEVKILDAIRGIPAAGMEVHVYKKAADDSWSEVATGKTTRDGEIHNLLNDENFLEGEYKVTFGTKSFWEGLNLNPFHLDHDVTFKAHTGGHQHYTIAVLLSPYSITTTAVVSSVQT,XP_043921383.1 Protopterus annectens,HIU,HIUase;UraH;5-hydroxyisourate hydrolase;HIU hydrolase;HIUHase;Transthyretin-related protein,MSSLRLSHIHHHIASDKNSQRMADGTSSRLTTHVLNTALGIPASNLALTVSRLEVVGNKEHWEQLTGRCTNSDGRCPGLLTNDHFIDATYKVRFETGDYWKEQGITSFYPYVEVVFTIKDPSQKYHIPLLLSPFSYTTYRGS,XP_043937680.1 Mus musculus,HIU,HIUase;UraH;5-hydroxyisourate hydrolase;HIU hydrolase;HIUHase;Transthyretin-related protein,MATESSPLTTHVLDTASGLPAQGLCLRLSRLEAPCQQWMELRTSYTNLDGRCPGLLTPSQIKPGTYKLFFDTERYWKERGQESFYPYVEVVFTITKETQKFHVPLLLSPWSYTTYRGS,Q9CRB3 Mus musculus,TTR,TTR;ATT;Transthyretin;ATTR;Prealbumin;TBPA;PALB,GPAGAGESKCPLMVKVLDAVRGSPAVDVAVKVFKKTSEGSWEPFASGKTAESGELHGLTTDEKFVEGVYRVELDTKSYWKTLGISPFHEFADVVFTANDSGHRHYTIAALLSPYSYSTTAVVSNPQN,P07309

lbleicher commented 1 year ago

Update: it seems the issue was caused by the fact one of the aliases for HIUase is part of one of the alias for TTRs (HIUases were also known as Transthyretin-related proteins), and that would cause it to drop it. Removing that alias seems to have solved the problem, I now got an alignment with sequences for both paralogs.

Kona-O commented 1 year ago

Hi! I just wanted to let you know that Mike is out of town until tomorrow. I'm glad you were able to figure out where things were going wrong in your seed file. I hope the rest of the steps go more smoothly. This is really helpful feedback for us to describe what to do and what not to do when constructing your seed dataset, as well as what kinds of checks topiary could potentially do to help the user from having too similar of aliases. Thank you!

lbleicher commented 1 year ago

Thanks, Kona, after fixing that detail it seems to be working fine with various input changes.

As for defining scope, am I right in not including an outgroup in the initial dataset?

Kona-O commented 1 year ago

That depends on what ancestors you are interested in. If you want to reconstruct the deepest ancestor of the family encompassing both HIUases and TTRs, then I think it would be best to include an outgroup species (with sequence and aliases) with the "key_species" column set to True. Otherwise, topiary might have a hard time finding and deciding to keep this kind of outgroup sequence. If you are reconstructing the deepest ancestor of either HIUases or TTRs then your current seed setup would be fine.

Also, to make things move faster, you can set the "key_species" column to True for only the most evolutionary divergent species for each paralog (example: both paralogs from Amblyraja radiata, both paralogs from Mus musculus, and the outgroup species sequence). It is ideal to choose species with well-annotated genomes, but that's not always an option. Then set all other special sequences you want to include in your project as "key_species" = False. The sequences with "key_species" = True will be used in the initial BLAST step, which by default searches through sequences 2 taxonomic ranks deeper than the rank set by your key_species. This is in hopes of finding deeper ancestors if they are out there. Then topiary will download and BLAST all of the resulting initial sequence hits against the full proteome available for each of the key_species. The ones set to false will not be used for these steps to minimize workload but will be maintained throughout your project.

harmsm commented 1 year ago

@lbleicher : Glad you got it to work! Did topiary give you any warning about the alias overlap? In many cases, it will do so; I'd like it to do so all of the time if possible.

lbleicher commented 1 year ago

It did! I probably missed it before because it came early in the output - perhaps it would be interesting to write the warnings as the last output text? Perhaps I had the same issue in other attempts but didn't notice it.

harmsm commented 1 year ago

Good to know it at least warned you. Unfortunately, python doesn’t give much control over output of warnings and errors. I’ll have to think about how to make that more user friendly...

On Mar 24, 2023, at 12:22 PM, lbleicher @.***> wrote:

It did! I probably missed it before because it came early in the output - perhaps it would be interesting to write the warnings as the last output text? Perhaps I had the same issue in other attempts but didn't notice it.

— Reply to this email directly, view it on GitHub https://github.com/harmslab/topiary/issues/36#issuecomment-1483301011, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFZA6V2JREIU24ZEB2Z3STW5XX6DANCNFSM6AAAAAAWBUMYGY. You are receiving this because you commented.