jeetsukumaran / DendroPy

A Python library for phylogenetic scripting, simulation, data processing and manipulation.
https://pypi.org/project/DendroPy/.
BSD 3-Clause "New" or "Revised" License
210 stars 61 forks source link

PHYLIP reader can only read one character matrix per file #29

Closed pranjalv123 closed 9 years ago

pranjalv123 commented 9 years ago

It would be useful if the PHYLIP reader could read a bunch of character matrices from a single PHYLIP file. For example a PHYLIP file might look like

200 345 ...200 lines with 345 characters each... 200 221 ...200 lines with 221 characters each... etc.

jeetsukumaran commented 9 years ago

What is the use case for this? What systems generate files like this? Is this documented anywhere?

pranjalv123 commented 9 years ago

I have some simulated datasets I'm trying to analyze that are like this - from http://www.cs.utexas.edu/~phylo/datasets/astral2/

I'm working on a patch that resolves this, I think I should have it ready relatively soon.

jeetsukumaran commented 9 years ago

At this point, I would rather not add support for this. The phylogenetic dataspace is already polluted with too many idiosyncratic, poorly/inconsistently/incorrectly/non-documented data formats as well as many standards-violating variants of existing data formats for us to introduce yet another one, which we will have to maintain in perpetuity.

Ideally, the upstream programs should be fixed to generate standards-compliant files, as your patch does. If they cannot, then a pre-processing step where the file is split based on the appropriate regular expression would be the solution.