snijderlab / stitch

Template-based assembly of proteomics short reads for de novo antibody sequencing and repertoire profiling
MIT License
22 stars 3 forks source link

Reformatting Datasets CSV #205

Closed kostrouc closed 1 year ago

kostrouc commented 1 year ago

Hi,

I have a custom de novo sequencing algorithm which generated some resulting predicted peptide sequences. May you tell me which input columns I am missing that may be required for stitch to function properly with a polyclonal antibody dataset from rabbit?

I have attached the example csv input file along with the edited polyclonal batchfile.

Below is the error I recieve when I run the script in command prompt:

Y:\Katie\stitch>.\stitch.exe batchfiles\RabbitPolyclonal.txt
>                                                                                                          |   0%  35 ms
>> Error: Object reference not set to an instance of an object.
 --> Stacktrace:
| Stitch.InputNameSpace.ParseHelper.ParseTemplateMatching(NameFilter nameFilter, KeyValue key)
| Stitch.ParseCommandFile.Batch(String path)
| Stitch.ToRunWithCommandLine.RunBatchFile(String filename, RunVariables runVariables)
| Stitch.ToRunWithCommandLine.Main()
Version: 1.3.0+b7279d1

rabbit_test.csv RabbitPolyclonal.txt

douweschulte commented 1 year ago

Cool work. The error you get is because you mistyped the path to the template (cunicus instead of cuniculus) but stitch should not crash in this way, so I will create a nicer error in the case anyone mistypes the template path again in the future. Once correctly typed new errors will pop up complaining that the file is not correct for this version of peaks. You specified in the Input section in the batchfile that the file you made is a peaks file version X+. There are multiple types of input files you can load as can be seen in the manual. Below is an example of an Peaks X+ file (the 200305_HER_test_04_DENOVO.csv included with the program).

Fraction,Source File,Feature,Peptide,Scan,Tag Length,Denovo Score,ALC (%),length,m/z,z,RT,Predict RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
10,20191211_F1_Ag5_peng0013_SA_her_Asp_N.raw,F10:3434,DYEKHKVYAC(+58.01),F10:3629,10,99,99,10,438.5332,3,19.91,-,2.3176E6,1312.5757,1.6,Carboxymethyl,100 100 100 100 100 100 100 100 100 100,DYEKHKVYAC(+58.01),ETHCD
4,20191211_F1_Ag5_peng0013_SA_her_Ela.raw,F4:4797,SGFGGLKN(+.98)TYLHW,F4:9505,13,99,99,13,494.2459,3,52.43,-,2.4924E7,1479.7146,0.9,Deamidation (NQ),100 100 100 100 100 100 100 100 100 100 100 100 100,SGFGGLKN(+.98)TYLHW,HCD
3,20191211_F1_Ag5_peng0013_SA_her_thermo.raw,F3:12703,LSC(+58.01)AASGFNLKDTY,F3:7983,14,99,99,14,774.3562,2,43.80,-,7.2888E7,1546.6973,0.4,Carboxymethyl,99 100 100 100 100 100 100 100 100 100 100 100 100 100,LSC(+58.01)AASGFNLKDTY,HCD

Thanks to your question I see there are no nice examples of the formats of all the different input file formats, I will add these to the manual. But for now creating a fasta file should be a fast way forward. For the future however if you have a file format you create from your program I could build in support for it directly. For that I would need an example file from your program then I can build it in.

douweschulte commented 1 year ago

I added the examples of the file formats to the documentation as promised above, the link from the previous message is still valid. Besides that I found the cause of the crash you reported below is the error message you would get with the nightly version of stitch (the version that I am currently working on).

There was 1 error while parsing.

Error: Could not open file
   ╭── ~\RabbitPolyclonal.txt:32:30
   │
30 │         Heavy Chain->
31 │             Segment->
32 │                 Path      : ../templates/Oryctolagus_cunicus_IGHV.fasta
   ·                             ───────────────────────────────────────────
33 │                 Name      : IGHV
34 │                 Identifier: ^(([a-zA-Z]+\d*)[\w-]*)
   ╵
note: The specified file could not be found.
help: Did you mean 'Oryctolagus_cuniculus_IGHC_uniprot.fasta'?
kostrouc commented 1 year ago

Is the ALC (%) column required for the recombination step?

I noticed that just using a text file with only peptide sequences on each new line generates a template matching and recombination result. See attached format of input reads and batch file.

r1.txt rabbit_reads.txt

douweschulte commented 1 year ago

No the ALC column is not necessary, any of the input file formats will work fine. The only reason for all the other files to exist is so that it is easier for users to use the software while maintaining all of the meta data the de novo peptide sequencing software generated.

kostrouc commented 1 year ago

Thank you! This was very helpful!

douweschulte commented 1 year ago

You are welcome