scharch / SONAR

Software for Ontogenic aNalysis of Antibody Repertoires
GNU General Public License v3.0
17 stars 10 forks source link

Assigned germline V gene not necessarily used as root in 3.2-run_IgPhyML.py #12

Closed ressy closed 3 years ago

ressy commented 3 years ago

The docstring for 3.2-run_IgPhyML.py says that -v is the "assigned germline V gene of known antibodes, for use in rooting the trees," but I'm running into some instance where it doesn't use this sequence ID for the root. I think this is because it figures out the sequence ID for the root of the tree based on a regular expression, and it can inadvertently pick up a different sequence depending on the full set of sequence IDs. The steps I see in 3.2-run_IgPhyML.py are:

  1. germ_seq is defined via -v argument
  2. germ_seq is written into the to-align file, along with the collected and native sequences
  3. germ_id defined by regex-matching each sequence ID from the aligned file
  4. germ_id is passed to igphyml as --root

In my case I have a "_LightSeq" suffix on each sequence in my natives.fa and re.search("(IG|VH|VK|VL|HV|KV|LV)", seq.id, re.I) matches the "ig" in each of those, overwriting the correct "IGLV..." sequence ID matched earlier in the file.

I can't override this by adding --root, either, since it's mutually-exclusive with -v. Would there be any downside to automatically setting arguments['--root'] = arguments['-v'] for the arguments['-v'] is not None case? Then that would get used as germ_id and passed to igphyml as the correct root.

scharch commented 3 years ago

I try to avoid changing command line parameters inside the program, but I did set it up to check for -v and not stupidly overwrite it. You should be good to go.