verilylifesciences / deepmass

DeepMass is a suite of tools to enable mass spectrometry data analysis using modern machine learning techniques.
Apache License 2.0
46 stars 13 forks source link

Prism preprocessing defaults to stripping (ox) from M #1

Closed pdcharles closed 5 years ago

pdcharles commented 5 years ago

Hiya,

The example input given in the Prism Readme

ModifiedSequence,Charge,Fragmentation,MassAnalyzer
AKM(ox)LIVR,3,HCD,ITMS
ILFWYK,2,CID,FTMS

is not actually parsed correctly with the command suggested in the same readme

python preprocess.py \
  --input_data="${DATA_DIR}/input_table.csv" \
  --output_data_dir="${DATA_DIR}" \
  --sequence_col="ModifiedSequence" \
  --charge_col="Charge" \
  --fragmentation_col="Fragmentation" \
  --analyzer_col="MassAnalyzer"

Running this command gives an output file metadata.csv (which is actually a tsv, but one thing at a time...) which reads

    index   ModifiedSequence    Charge  Fragmentation   MassAnalyzer    Length  Charge_pos  Length_pos  Fragmentation_HCD   Fragmentation_CID   MassAnalyzer_ITMS   MassAnalyzer_FTMS
0   0   AKMLIVR 3   HCD ITMS    7   0.42857142857142855 0.23333333333333334 1   0   1   0
1   1   ILFWYK  2   CID FTMS    6   0.2857142857142857  0.2 0   1   0   1

Furthermore, checking the JSON, the one-hot array for first peptide, residue position 3 is hot (has a 1 rather than a zero) at index 10. If I've read the code correctly, the residue alphabet is generated from the 20 standard AAs + a special definition for 'M(ox)', and is then sorted, which puts bare 'M' at index 10 and 'M(ox)' at index 11 (so the residue is incorrect both in the metadata and in the JSON sent to the model).

  mol_weights = pd.Series(_MOL_WEIGHTS)
  alphabet = [k for k in mol_weights.keys() if not k.startswith(_GROUP)]
  alphabet = sorted(alphabet)
  one_hot_encoding = pd.get_dummies(alphabet).astype(int).to_dict(orient='list')

It looks like this is because preprocess.py defaults to calling the function clean_peptides (from utils) on the sequence string:

flags.DEFINE_bool(
    'clean_peptides',
    True,
    'True if peptide modifications are in [x] format.')
_RE_AA = re.compile(r'[A-Z](\[(\w+)\])?')
...
def clean_peptides(peptide_list):
...
  clean_peptide_list = []
  for peptide in peptide_list:
    clean_peptide = []
    for residue in re.finditer(_RE_AA, peptide):

This function uses a regex that's only looking for '[modification]' format, so it will strip any modifications in the '(modification)' format.

1) Should the clean_peptides flag be set true by default? It seems a bit counterintuitive since MaxQuant-style modification definitions are then an invalid input without explicitly setting --clean_peptides="False" as an argument. If so, I think prism/readme.md might benefit from an update to clarify :-)

2) Is it intended that 'M(ox)', presumably any other future mods, is sorted along with the bare AAs when creating the alphabet? It's not clear if it is an oversight (from last-minute addition of M(ox) capability) or intended that when the alphabet is sorted, the order of residue definitions diverges from the one given in _MOL_WEIGHTS. This will mean every time a new modification is added, as well as the input array changing length, the significance of the one-hot index changes as well - e.g. to define say 'K(tmt10)' will mean (after sorting the alphabet) that 'M' is now signified by hot position 11 and 'M(ox)' by hot position 12. Seems like a recipe for bugs!

Best,

Phil

petercim commented 5 years ago

Hi Phil,

Thank you very much for your comments and suggestions! We updated the README file accordingly in the last commit, hopefully those changes make it clearer.

As for your second comment, yes, we're already thinking about changing our one-hot-encoding strategy as we keep adding new modifications.

Best, Peter