added AA parsing to fasta file reading

marrink-lab / polyply_1.0

Generate input parameters and coordinates for atomistic and coarse-grained simulations of polymers, ssDNA, and carbohydrates

Apache License 2.0

125 stars 22 forks source link

added AA parsing to fasta file reading #303

Closed csbrasnett closed 1 year ago

csbrasnett commented 1 year ago

Previously the fasta reader couldn't actually parse AA sequences, this should fix that. I haven't tested how this should affect the reading of .fasta files for DNA/RNA sequences, but I don't think it should.

csbrasnett commented 1 year ago

This is now failing because the tests expect _identify_nucleotypes to return a tuple of length 2, but it now returns something of length 3. I realise also that I've written the AA identifier into the _identify_nucleotypes function, which isn't actually true. So we could either rename that function in order to deal with this, or have a separate method for identifying AA sequences in fasta files.

(on this note, is it right that we expect 'RNA' or 'DNA' to be in the fasta string name? I guess it's fine to enforce it here, but I think in general the string can be anything?)

fgrunewald commented 1 year ago

@csbrasnett thanks for the PR already! You can also convert this to a proper PR already. Here are my thoughts:

the string in fasta can be anything. That is true. However, there is no way (to my knowledge) to distinguish if RNA, DNA or a protein is described without knowing the context. Thus we opted for enforcing RNA, DNA in the string descriptor instead of having an additional CLI flag that needs to be propagated.
we should probably rename _identify_nucleotypes to _identify_residues when adding AA
once renamed you can fix the test such that it can take 3 outputs; let me know if you need help with making the tests
only question remains should we simply assume if something is not RNA or DNA that it is a protein? @pckroon any thoughts no this from the wise old people ?

csbrasnett commented 1 year ago

Thanks for checking @fgrunewald, I had a go at dealing with the tests but doesn't look like they worked

fgrunewald commented 1 year ago

@csbrasnett could you add a test like the one in line 120 but for protein fasta files? The code should look something like the one shown below. You will need to put an example file with a sequence in the test data directory and then adjust the sequence list in the function below by the sequence in the data file

@pytest.mark.parametrize('extension, ', (
      "ig",
      "fasta"
     ))
def test_sequence_parses_PROTEIN(extension):
    filepath = Path(TEST_DATA + "/simple_seq_files/test_protein."+ extension)
    seq_graph = MetaMolecule.parsers[extension](filepath)
    # replace below with your small protein sequence; ignore termini
    monomers = ["A5", "U", "C", "G", "U", "A", "C", "A", "U3"]
    ref_graph = _monomers_to_linear_nx_graph(monomers)
    assert nx.is_isomorphic(seq_graph, ref_graph, node_match=_node_match)

csbrasnett commented 1 year ago

As far as I understand why it's failing now, the test isn't finding the extension, but I'm not sure why?

fgrunewald commented 1 year ago

sorry my bad. You need to add the following lines on top:

@pytest.mark.parametrize('extension, ', (
      "ig",
      "fasta"
     ))

The parametric essentially tells pytest which parameters to evaluate for the input arguments of the function.

csbrasnett commented 1 year ago

ah grand thanks, I thought it was captured by the previous parametric, good to know. In this case I'll also remove the test for the .ig file because that's not a format that would be used for proteins.

fgrunewald commented 1 year ago

@csbrasnett since you removed the ig, you can also drop the header and simply put in the full path to the fasta file. It is failing at the moment because it is splitting fasta into f a s t a

fgrunewald commented 1 year ago

@csbrasnett also remove the @pytest.mark.parametrize() It is only needed when you explore a matrix of different input values

fgrunewald commented 1 year ago

@csbrasnett I think the reference should be the full three letter residue names not the one letter code. It is the sequence you get after parsing and processing the file. By chance for RNA they are one letter residue names

csbrasnett commented 1 year ago

Thanks for all your help! First of many I hope :smile: