Closed csbrasnett closed 1 year ago
This is now failing because the tests expect _identify_nucleotypes to return a tuple of length 2, but it now returns something of length 3. I realise also that I've written the AA identifier into the _identify_nucleotypes function, which isn't actually true. So we could either rename that function in order to deal with this, or have a separate method for identifying AA sequences in fasta files.
(on this note, is it right that we expect 'RNA' or 'DNA' to be in the fasta string name? I guess it's fine to enforce it here, but I think in general the string can be anything?)
@csbrasnett thanks for the PR already! You can also convert this to a proper PR already. Here are my thoughts:
_identify_nucleotypes
to _identify_residues
when adding AAThanks for checking @fgrunewald, I had a go at dealing with the tests but doesn't look like they worked
@csbrasnett could you add a test like the one in line 120 but for protein fasta files? The code should look something like the one shown below. You will need to put an example file with a sequence in the test data directory and then adjust the sequence list in the function below by the sequence in the data file
@pytest.mark.parametrize('extension, ', (
"ig",
"fasta"
))
def test_sequence_parses_PROTEIN(extension):
filepath = Path(TEST_DATA + "/simple_seq_files/test_protein."+ extension)
seq_graph = MetaMolecule.parsers[extension](filepath)
# replace below with your small protein sequence; ignore termini
monomers = ["A5", "U", "C", "G", "U", "A", "C", "A", "U3"]
ref_graph = _monomers_to_linear_nx_graph(monomers)
assert nx.is_isomorphic(seq_graph, ref_graph, node_match=_node_match)
As far as I understand why it's failing now, the test isn't finding the extension, but I'm not sure why?
sorry my bad. You need to add the following lines on top:
@pytest.mark.parametrize('extension, ', (
"ig",
"fasta"
))
The parametric essentially tells pytest which parameters to evaluate for the input arguments of the function.
ah grand thanks, I thought it was captured by the previous parametric, good to know. In this case I'll also remove the test for the .ig file because that's not a format that would be used for proteins.
@csbrasnett since you removed the ig, you can also drop the header and simply put in the full path to the fasta file. It is failing at the moment because it is splitting fasta into f a s t a
@csbrasnett also remove the @pytest.mark.parametrize()
It is only needed when you explore a matrix of different input values
@csbrasnett I think the reference should be the full three letter residue names not the one letter code. It is the sequence you get after parsing and processing the file. By chance for RNA they are one letter residue names
Thanks for all your help! First of many I hope :smile:
Previously the fasta reader couldn't actually parse AA sequences, this should fix that. I haven't tested how this should affect the reading of .fasta files for DNA/RNA sequences, but I don't think it should.