lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

LFQ output is empty #74

Closed tomthun closed 11 months ago

tomthun commented 1 year ago

Hey Michael,

I tried running a bigger Dataset with LFQ and wondered why the lfq.tsv table is empty. The columns header gets generated but all rows are empty. Do I need to specify something else in the config.json than setting lfq = True?

Best, Thomas

lazear commented 1 year ago

Can you please share the results.json file and log information (try also running sage with tracing enabled command SAGE_LOG=trace ./sage)? Is it also possible to share an mzML?

tomthun commented 1 year ago

Hey, here is the results.json. I still have to think about a way to provide a mzml file.

During a discussion with a college we came to the conclusion that it may be a problem with the conversion of .d to mzml. As i have described in #72 i used alphatims to convert to mgf and msconvert to convert to mzml and hence there might be some information loss of ms1 spectra.

lazear commented 1 year ago

What is the output log? Are there PSMs and peptides confidently IDed?

tomthun commented 1 year ago

I am currently creating the log files. Do you have a private email so that i can send you the link to the mzml?

Regarding PSMs and peptides: they are confidently ided, i get 2401 decoys of 822772 peptides when filtering for PSM and peptide qvalue < 0.01 for both

Edit: I added "SAGE_LOG=trace": "./sage", to the json file but the logs aren't created. If i specifiy the full path, it will also not be created.

lazear commented 1 year ago

My email can be found in this file: https://github.com/lazear/sage/blob/master/crates/sage/Cargo.toml

The tracing part needs to go before the actual command line invocation of sage, not in the JSON file: SAGE_LOG=trace ./sage config.json -f human.fasta. Given that sage is ID'ing peptides at 1% FDR, it seems likely that this is an mzML/TIMS issue. I will try and investigate once I get the file!

tomthun commented 1 year ago

Does tracing work with windows? I wrote you an email with the link to the data! :)

lazear commented 1 year ago

Thanks Thomas, I got the data and will ping you when I've had a change to investigate.

For windows you'll need to change the syntax depending on whether you're using powershell or something else. Basically just need to set an environment variable "SAGE_LOG=trace"

lazear commented 1 year ago

(There was an issue with the mzML file that was communicated to Thomas via email, closing for no-response)

Feel free to re-open if there is an issue with other mzMLs

tomthun commented 1 year ago

Hey, sorry for the delayed answer but i'd like to reopen this issue again and generally ask how you would do conversion (i currently use msconvert) for different types of data (.raw or .d). When i try to convert timsTof data in any way Sage seems to not pick up the MS1 spectra / cannot do LFQ. Do you have any general recommendations for conversions?

lazear commented 1 year ago

I haven't personally processed any timsTOF data, but I think @jspaezp has, perhaps they are willing to give some tips.

I wouldn't say the issue is that Sage cannot do LFQ or pick up MS1 spectra... the issue with the file you shared is that there weren't any MS1 scans present at all (at least nothing with "ms level = 1"). So this seems like something that should be resolvable with a different conversion strategy

jspaezp commented 1 year ago

Helo there! at this time we have gotten the best results using this guy to convert the .d files to .mzml: https://github.com/mafreitas/tdf2mzml

But I think the underlying issue is that depending on the conversion tool, the way that collapsing all the 'scans' in a 'frame' (IMS elution) into a single scan can change dramatically. Tools that just 'collapse' the IMS dimension tend to give 'pseudo-centroid' data that might screw up with tools that match peaks by 'closest match' instead of integrating the matching area. On the other hand tdf2mzml seems to do some 'centroiding' that collapses the IMS features into real centroided peaks.

It is worth noting that I am aware that with some arguments msconvert does not collapse the IMS dimension, leading to every 'frame' being recorded as MANY scans with the ramping mobility values.

I know there was a mention of an intermediate conversion to .mgf ... and I can see that some tools/argument combinations might just remove MS1 data to support tools that do not expect ms1 scans in there.

header of an ms1 spectrum generated with tfd2mzml:

        <spectrum index="0" defaultArrayLength="24291" id="index=1">
          <cvParam cvRef="PSI-MS" accession="MS:1000511" name="ms level" value="1"/>
          <cvParam cvRef="PSI-MS" accession="MS:1000285" name="total ion current" value="19601770.0"/>
          <cvParam cvRef="PSI-MS" accession="MS:1000505" name="base peak intensity" value="4691.0" unitCvRef="PSI-MS" unitAccession="MS:1000131" unitName="number of detector counts"/>
          <cvParam cvRef="PSI-MS" accession="MS:1000504" name="base peak m/z" value="325.2882862039353" unitCvRef="PSI-MS" unitAccession="MS:1000040" unitName="m/z"/>
          <cvParam cvRef="PSI-MS" accession="MS:1000130" name="positive scan" value=""/>
          <cvParam cvRef="PSI-MS" accession="MS:1000127" name="centroid spectrum" value=""/>
          <cvParam cvRef="PSI-MS" accession="MS:1000579" name="MS1 spectrum" value=""/>
          <scanList count="1">
            <cvParam cvRef="PSI-MS" accession="MS:1000795" name="no combination" value=""/>
            <scan>
              <cvParam cvRef="PSI-MS" accession="MS:1000016" name="scan start time" value="0.0079112" unitCvRef="PSI-MS" unitAccession="UO:0000031" unitName="minute"/>
              <scanWindowList count="1">
                <scanWindow>
                  <cvParam cvRef="PSI-MS" accession="MS:1000501" name="scan window lower limit" value="100.000000" unitCvRef="PSI-MS" unitAccession="MS:1000040" unitName="m/z"/>
                  <cvParam cvRef="PSI-MS" accession="MS:1000500" name="scan window upper limit" value="1600.000000" unitCvRef="PSI-MS" unitAccession="MS:1000040" unitName="m/z"/>
                </scanWindow>
              </scanWindowList>
            </scan>
          </scanList>

LMK what you think (I will try to find some public hela dad dataset to use as an example)