FASTA format - Githubissues

TANIAKMONS commented 1 month ago

Hello,

I have an issue with the FASTA format. It is a FASTA format which was made from the Illumina Sequencing and annotated with KREGG. We have tried a first time wihtout Uniprot annotation and it did not. Will it work if the FASTA is composed of different annotation uncluded the Uniprot one ? it seems that we can't just have the Uniprot FASTA format.

Thanks in advance TK

vdemichev commented 1 month ago

Hi TK,

Protein sequence IDs should be read correctly from any FASTA. All other information you can always pull out of the FASTA using some FASTA-reading R package, to annotate DIA-NN's output report.

We have tried a first time wihtout Uniprot annotation and it did not.

How did it manifest?

Best, Vadim

saradufour commented 1 month ago

Hi,

I'm having the same issue in the library free search. The FASTA header for example looks like this:

_>P62874,Q3TQ70|TX=10090 OS=Mouse GN=ENSMUSG00000029064.16,Gnb1 TA=NM_001160016.1,ENSMUST00000105616.10,XM_017319977.2,NM_001160017.1,ENSMUST00000030940.14,ENSMUST00000176637.2,ENSMUST00000165335.8,NM_008142.4 PA=ENSMUSP00000030940.8,NP_032168.1,ENSMUSP00000135091.2,XP_017175466.1,ENSMUSP00000101241.4,NP_001153488.1,ENSMUSP00000130123.2,NP001153489.1,P62874,Q3TQ70 (fasta file from openprot (microprotein identification) with > 500000 entries) and the output in the log is the following:

[0:48] Processing FASTA [1:35] Assembling elution groups [2:47] 23495123 precursors generated [2:47] Gene names missing for some isoforms [2:47] Library contains 1 proteins, and 1 genes [2:51] Encoding peptides for spectra and RTs prediction

Any idea how to fix this issue?

Thanks ! Best, Sara

vdemichev commented 2 weeks ago

Hi Sara,

DIA-NN will not correctly extract protein names from this. It should get the IDs OK though, i.e. you can annotate DIA-NN output using some FASTA-reading R package.

Best, Vadim

TANIAKMONS commented 5 days ago

Hi Vadim,

I had the same thing than Sara (Library contains 1 proteins, and 1 genes). We have done a scrpit to incorporate Uniprot annotations within the FASTA and now we use DIANN 1.9. This is the result we have:

10 files will be processed [0:00] Loading FASTA C:\Tania\output_proteinpilot2.fasta [2:07] Processing FASTA [4:11] Assembling elution groups [6:57] 59894740 precursors generated [6:58] Gene names missing for some isoforms [6:58] Library contains 717220 proteins, and 1 genes [7:09] Encoding peptides for spectra and RTs prediction [9:53] Predicting spectra and IMs [370:52] Predicting RTs [409:47] Decoding predicted spectra and IMs [411:19] Decoding RTs [412:01] Saving the library to C:\Tania\DIA-NN\1.9\report.predicted.speclib [415:57] Initialising library

First pass: generating a spectral library from DIA data

[418:51] File #1/10 [418:51] Loading run C:\Tania\PSF21h.wiff [421:59] 59872940 library precursors are potentially detectable [423:20] Processing.

Since it is very long to process .... we will run it on a more powerfull server, it works with linux. Is it the smae command lin ethan with DIANN 1.8 ?

Thanks, Kind Regards,

TK

vdemichev commented 5 days ago

Hi TK,

I would suggest to try the recommended settings first, which should result in much smaller predicted library & search space.

No, I don't recommend using 1.8.1. If you do, please make sure to use the predicted library generated by 1.9.

Best, Vadim

vdemichev / DiaNN

FASTA format #1029