Closed pdcharles closed 5 years ago
Hi Phil,
Thank you very much for your comments and suggestions! We updated the README file accordingly in the last commit, hopefully those changes make it clearer.
As for your second comment, yes, we're already thinking about changing our one-hot-encoding strategy as we keep adding new modifications.
Best, Peter
Hiya,
The example input given in the Prism Readme
is not actually parsed correctly with the command suggested in the same readme
Running this command gives an output file metadata.csv (which is actually a tsv, but one thing at a time...) which reads
Furthermore, checking the JSON, the one-hot array for first peptide, residue position 3 is hot (has a 1 rather than a zero) at index 10. If I've read the code correctly, the residue alphabet is generated from the 20 standard AAs + a special definition for 'M(ox)', and is then sorted, which puts bare 'M' at index 10 and 'M(ox)' at index 11 (so the residue is incorrect both in the metadata and in the JSON sent to the model).
It looks like this is because preprocess.py defaults to calling the function clean_peptides (from utils) on the sequence string:
This function uses a regex that's only looking for '[modification]' format, so it will strip any modifications in the '(modification)' format.
1) Should the clean_peptides flag be set true by default? It seems a bit counterintuitive since MaxQuant-style modification definitions are then an invalid input without explicitly setting
--clean_peptides="False"
as an argument. If so, I think prism/readme.md might benefit from an update to clarify :-)2) Is it intended that 'M(ox)', presumably any other future mods, is sorted along with the bare AAs when creating the alphabet? It's not clear if it is an oversight (from last-minute addition of M(ox) capability) or intended that when the alphabet is sorted, the order of residue definitions diverges from the one given in _MOL_WEIGHTS. This will mean every time a new modification is added, as well as the input array changing length, the significance of the one-hot index changes as well - e.g. to define say 'K(tmt10)' will mean (after sorting the alphabet) that 'M' is now signified by hot position 11 and 'M(ox)' by hot position 12. Seems like a recipe for bugs!
Best,
Phil