salilab / pmi

Python Modeling Interface
https://integrativemodeling.org/nightly/doc/ref/namespaceIMP_1_1pmi.html
11 stars 11 forks source link

IMP.pmi.topology.TopologyReader does not process fasta_dir argument corretly #247

Closed tpeulen closed 5 years ago

tpeulen commented 5 years ago

When reading topology file by the reading routine IMP.pmi.topology.TopologyReader unexpected results are obtained when the fasta_dir argument is used.

For instance, for the topology file

|molecule_name|domain_namecolor|fasta_fn|fasta_id|pdb_fn|chain|residue_range|pdb_offset|bead_size|em_residues_per_gaussian|rigid_body|super_rigid_body|chain_of_super_rigid_bodies|flags|
|hGBP1|blue|1f5n.fasta|1f5n_A|1f5n.pdb|A|1,308|0|10|0|1|1,7|1||
|hGBP1|blue|1f5n.fasta|1f5n_A|1f5n.pdb|A|309,348|0|5|0|2|1,7|1||
|hGBP1|blue|1f5n.fasta|1f5n_A|1f5n.pdb|A|349,372|0|5|0|3|1,7|1||
|hGBP1|blue|1f5n.fasta|1f5n_A|1f5n.pdb|A|373,431|0|5|0|4|1,7|1||
|hGBP1|blue|1f5n.fasta|1f5n_A|1f5n.pdb|A|432,481|0|5|0|5|1,7|1||
|hGBP1|blue|1f5n.fasta|1f5n_A|1f5n.pdb|A|482,565|0|5|0|6|1,7|1||
|hGBP1|blue|1f5n.fasta|1f5n_A|1f5n.pdb|A|566,END|0|5|0|7|1,7|1||

reader = IMP.pmi.topology.TopologyReader(
    topology_file,
    pdb_dir=data_directory + "/pdb/",
    fasta_dir=data_directory + "/fasta/"
)
reader.get_components()

returns

[|hGBP1|blue|1f5n.fasta|1f5n_A|1f5n.pdb|A|1,308|0|10|0|1|1,7|1|,
 |hGBP1|blue|./00_data//fasta/1f5n.fasta|1f5n_A|1f5n.pdb|A|309,348|0|5|0|2|1,7|1|,
 |hGBP1|blue|./00_data//fasta/1f5n.fasta|1f5n_A|1f5n.pdb|A|349,372|0|5|0|3|1,7|1|,
 |hGBP1|blue|./00_data//fasta/1f5n.fasta|1f5n_A|1f5n.pdb|A|373,431|0|5|0|4|1,7|1|,
 |hGBP1|blue|./00_data//fasta/1f5n.fasta|1f5n_A|1f5n.pdb|A|432,481|0|5|0|5|1,7|1|,
 |hGBP1|blue|./00_data//fasta/1f5n.fasta|1f5n_A|1f5n.pdb|A|482,565|0|5|0|6|1,7|1|,
 |hGBP1|blue|./00_data//fasta/1f5n.fasta|1f5n_A|1f5n.pdb|A|566,END|0|5|0|7|1,7|1|]

To me this is unexpected.

Hence, I suggest to change in the TopologyReader class

        if c.molname not in self.molecules:
            self.molecules[c.molname] = _TempMolecule(c)
        else:
            # COPY OR DOMAIN
            c._orig_fasta_file = self.molecules[c.molname].orig_component.fasta_file
            c.fasta_id = self.molecules[c.molname].orig_component.fasta_id
            self.molecules[c.molname].add_component(c,c.copyname)

to

        if c.molname not in self.molecules:
            self.molecules[c.molname] = _TempMolecule(c)
        else:
            # COPY OR DOMAIN
            c._orig_fasta_file = self.molecules[c.molname].orig_component._orig_fasta_file
            c.fasta_id = self.molecules[c.molname].orig_component.fasta_id
            self.molecules[c.molname].add_component(c,c.copyname)

To produce components as likely originally intended.

benmwebb commented 5 years ago

Sure, open a pull request with this change and a test case to assert on the new behavior. Then we'll see if it breaks anything.