statisticalbiotechnology / triqler

The triqler (TRansparent Identification-Quantification-linked Error Rates)'s source and example code
Apache License 2.0
19 stars 9 forks source link

Empty extra columns cause peptides to be considered shared #4

Closed MatthewThe closed 5 years ago

MatthewThe commented 5 years ago

As proteins are separated by tabs at the end of each row in the triqler input file, empty columns are considered as extra proteins and, thereby, the peptide is considered shared and is discarded. This results in the following error:

Parsing triqler input file
Calculating identification PEPs
featureClusterIdx: 0
featureClusterIdx: 10000
Dividing intensities by 100000 for increased readability
Surviving spectrumIdxs: 12452
Converting to peptide quant rows
Calculating peptide-level identification PEPs
Writing peptide quant rows to file
Fitting hyperparameters
Traceback (most recent call last):
  File "/anaconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/anaconda2/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/anaconda2/lib/python2.7/site-packages/triqler/__main__.py", line 8, in <module>
    main()
  File "/anaconda2/lib/python2.7/site-packages/triqler/triqler.py", line 36, in main
    runTriqler(params, args.in_file, args.out_file)
  File "/anaconda2/lib/python2.7/site-packages/triqler/triqler.py", line 104, in runTriqler
    diff_exp.doDiffExp(params, peptQuantRows, triqlerOutputFile, getPickedProteinCalibration, selectComparisonBayesTmp, qvalMethod = qvalMethod)
  File "/anaconda2/lib/python2.7/site-packages/triqler/diff_exp.py", line 17, in doDiffExp
    proteinOutputRows = proteinQuantificationMethod(peptQuantRows, params, proteinModifier, getEvalFeatures)
  File "/anaconda2/lib/python2.7/site-packages/triqler/triqler.py", line 339, in getPickedProteinCalibration
    hyperparameters.fitPriors(peptQuantRows, params) # updates priors
  File "/anaconda2/lib/python2.7/site-packages/triqler/hyperparameters.py", line 57, in fitPriors
    fitLogitNormal(observedXICValues, params, plot)
  File "/anaconda2/lib/python2.7/site-packages/triqler/hyperparameters.py", line 84, in fitLogitNormal
    vals, bins = np.histogram(observedValues, bins = np.arange(minBin, maxBin, 0.1), normed = True)
ValueError: arange: cannot compute length

This is a problem if the file is saved as .tsv by e.g. Excel, which would pad each row with extra empty columns to match the longest row.

To solve this, we should simply only add non-empty proteins to the peptide during parsing. Furthermore, we should also display a better error message in case no proteotypic peptides are found, together with a suggestion that this could be due to shared peptides.