volpato30 / DeepNovoV2

Pytorch implementation of DeepNovoV2, a state-of-the-art de novo peptide sequencing model.
Other
20 stars 17 forks source link

denovo_search denovo_input_feature_file #2

Open cguetot opened 4 years ago

cguetot commented 4 years ago

Hi,

If I want to make predictions for a new mgf file, do I have to leave an empty cell for the 'seq' column in its feature file?

I noted that the headers of the feature files are defined as follow; "spec_group_id","m/z","z","rt_mean","seq","scans","profile","feature area"

On the other hand, do you have any setting recommendations (deepnovo_config.py) for data coming from a QExactive-HF?

best,

Carlos

volpato30 commented 4 years ago

Hi Carlos,

Yes, you need keep seq column and leave anything in that cell (seq won't be used when doing de novo). The reader will search "seq" in the header so deleting that column should raise error.

Let me know if you have any problems when running it.

Best,

Rui

Carlos Gueto-Tettay notifications@github.com 于2020年9月10日周四 上午10:29写道:

Hi,

If I want to make predictions for a new mgf file, do I have to leave an empty cell for the 'seq' column in its feature file?

I noted that the headers of the feature files are defined as follow; "spec_group_id","m/z","z","rt_mean","seq","scans","profile","feature area"

best,

Carlos

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/volpato30/DeepNovoV2/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB2A6P6VTK3AG3ZES6JPBTSFDPFXANCNFSM4RFFFMXA .

cguetot commented 4 years ago

Hi Rui,

I modified my comment so you did not get the change to read my second question:

do you have any setting recommendations (deepnovo_config.py) for data coming from a QExactive-HF? both for training and denovo search.

Carlos

volpato30 commented 4 years ago

Hi Carlos,

I don't think you need to change parameters for Q Exactive data. Just make sure your training data have relatively similar properties (enzyme, instrument, fragmentation method) as the data you want to perform de novo sequencing.

Rui

Carlos Gueto-Tettay notifications@github.com 于2020年9月10日周四 下午3:49写道:

Hi Rui,

I modified my comment so you did not get the change to read my second question:

do you have any setting recommendations (deepnovo_config.py) for data coming from a QExactive-HF? both for training and denovo search.

Carlos

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/volpato30/DeepNovoV2/issues/2#issuecomment-690680336, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB2A6OOZ74JMV6QY3ERCPTSFEUUPANCNFSM4RFFFMXA .

cguetot commented 3 years ago

can I use the same knapsack file from deepnovo? or are they different?

volpato30 commented 3 years ago

They are the same. But be careful about the ptm settings (AAs included in vocab_reverse). One knapsack file corresponds to a specfic set of ptms and MZ_MAX. I believe the original deepnovo knapsack is generated with C(Cam), M(oxidation) NQ(Deamidation) and MZ_MAX of 3000.

cguetot commented 3 years ago

how can I build a custom knapsack for DeepNovoV2, with, for example, MZ_MAX of 4000 ?

volpato30 commented 3 years ago

change the MZ_MAX to 4000 in config file, then $>make denovo. When the program detects no knapsack.npy file in the current folder it will start building a new one with the configurations in deepnovo_config.py file

cguetot commented 3 years ago

That's perfect.

Thanks,

Carlos

On Wed, Feb 3, 2021, 15:50 volpato30 notifications@github.com wrote:

change the MZ_MAX to 4000 in config file, then $>make denovo. When the program detects no knapsack.npy file in the current folder it will start building a new one with the configurations in deepnovo_config.py file

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/volpato30/DeepNovoV2/issues/2#issuecomment-772565783, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIK5UOB2YYAV27SC5FCLC2TS5FPDXANCNFSM4RFFFMXA .

cguetot commented 3 years ago

Hi again,

several questions:

1) how can I increase the training, valid and test sizes? context: I see variables like train_stack_size, valid_stack_size and test_stack_size are not used anymore in this code compared to the old tensorflow version.

2) I also see variable called batch_size with a lower value (32) respect to the original code (128). how does it affect the training process?

3) If I increase "num_workers", will it speed up the calculations?

4) is it possible to get the top n best candidates for each scan?

thanks in advance,

Carlos

volpato30 commented 3 years ago
  1. Do you mean batch size? Batch size are configured with batch_size variable in config file. The number of data points totally depends on your input file.
  2. for training you should use lower value. 128 is what I used for doing de novo. I usually train model with batch size of 16 or 32 and I don't observe significant difference in the final accuracy of the model.
  3. num_workers controls the number of CPU thread to provide (i.e. preprocess) training data to GPU. If you observe that your GPU usage is not full during training, then increasing it might help. Otherwise there is no need to increase the value.
  4. Yes, you totally can. The current beam search retrains the top 5 (also configurable in config file) candidates. You just need to slightly modify the denovo.py and writer.py file to output top 5 instead of top 1.