twopin / CAMP

predicting peptide-protein interactions
117 stars 30 forks source link

How to download the file uniprot2seq from UniProt Website #14

Closed milkboylyf closed 1 year ago

milkboylyf commented 2 years ago

Hi, in the function load_uni_seq in data_prepare/step1_pdb_process.py, it requires an input called uniprot2seq_file, uniprot2seq from UniProt Website (a tab separated file with fields including Uniprot_id,Uniprot Sequence,Protein_name,Protein_families), how to download this file from UniProt Website and can you provide a download link for this file. Thanks.

milkboylyf commented 2 years ago

Hi, twopin. I also get trouble in the file format called crawl_results.csv in the module data_prepare/query-mapping.py, can you provide template data for this file. Thanks.

twopin commented 2 years ago

Hi, you can download the sequence files from : https://www.uniprot.org/downloads.

twopin commented 2 years ago

Sorry I did not save the intermediate file but you can use your own data. The important part of the script begins from line 49. You can adjust the script according to your own data format.

milkboylyf commented 2 years ago

Hi, you can download the sequence files from : https://www.uniprot.org/downloads.

Hi, twopin, thanks for your reply. When I visit the link https://www.uniprot.org/downloads, it represents as follows and which link needs to be clicked so as to get the file uniprot2seq_file. After download the file, should I further clean the data to get the columns Uniprot_id,Uniprot Sequence,Protein_name,Protein_families. image

image

zhouruikang1024 commented 2 years ago

Hi, twopin. I also get trouble in the file format called crawl_results.csv in the module data_prepare/query-mapping.py, can you provide template data for this file. Thanks.

Hi, I also don't know the crawl_results.csv format. Have you solved the crawl_results.csv format problem? Could you please provide this file or a sample data format?

rocke2020 commented 2 years ago

Dear twopin, could you share us exactly how to get the uniprot2seq_file from https://www.uniprot.org/help/downloads. what's final link inside the https://www.uniprot.org/help/downloads thanks in advance!!

rocke2020 commented 2 years ago

Hi, you can download the sequence files from : https://www.uniprot.org/downloads.

Hi, twopin, thanks for your reply. When I visit the link https://www.uniprot.org/downloads, it represents as follows and which link needs to be clicked so as to get the file uniprot2seq_file. After download the file, should I further clean the data to get the columns Uniprot_id,Uniprot Sequence,Protein_name,Protein_families. image

image

Do you finally get the download link? If so, could share with us? thanks!!

Yiqiu-Zhang commented 1 year ago

I did not find any file that fits the description, but I find a file which contains the Uniport ID and the proteins families https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/similar

twopin commented 1 year ago

I've uploaded the exact data I downloaded when doing this project. Actually the file was directly downloaded from UniProt when just selecting all the fasta sequence file. You guys can directly use mine or downloaded the latest version.

twopin commented 1 year ago

Hi, twopin. I also get trouble in the file format called crawl_results.csv in the module data_prepare/query-mapping.py, can you provide template data for this file. Thanks.

This file is the output file using the crawling script.

twopin commented 1 year ago

The UniProt website has updated since 2020 and here is how to download now: image image

rocke2020 commented 1 year ago

@twopin thanks for your reply. Do you get the Protein_families from https://www.uniprot.org/uniprotkb?query=reviewed:false or https://www.uniprot.org/help/downloads

I think the downloaded fasta file only have ProteinName which is not its protein family name, yes? I think we make this issue to ask help to you on where get the protein family with uniprot id. Or do you treat ProteinName as protein faimily name, if there is "MHC" in the ProteinName, filter out it? We find protein family name in the link below. https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/similar

db|UniqueIdentifier|EntryName ProteinName OS=OrganismName OX=OrganismIdentifier [GN=GeneName ]PE=ProteinExistence SV=SequenceVersion

BTW, could you share: this paper use the review uniprot sp sequences or the unreviewed sequences? your picture is the unreviewed sequences, while you upload and codes seems to use review uniprot sp sequences. thanks for your share!

twopin commented 1 year ago

Oh actually you don't need the protein family information for CAMP but if you do need that for downstreaming analysis, you can customize the column information on UniProt when downloading the sequence.

image image
twopin commented 1 year ago

@twopin thanks for your reply. Do you get the Protein_families from https://www.uniprot.org/uniprotkb?query=reviewed:false or https://www.uniprot.org/help/downloads

I think the downloaded fasta file only have ProteinName which is not its protein family name, yes? I think we make this issue to ask help to you on where get the protein family with uniprot id. Or do you treat ProteinName as protein faimily name, if there is "MHC" in the ProteinName, filter out it? We find protein family name in the link below. https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/similar

db|UniqueIdentifier|EntryName ProteinName OS=OrganismName OX=OrganismIdentifier [GN=GeneName ]PE=ProteinExistence SV=SequenceVersion

BTW, could you share: this paper use the review uniprot sp sequences or the unreviewed sequences? your picture is the unreviewed sequences, while you upload and codes seems to use review uniprot sp sequences. thanks for your share!

please see my reply below.

rocke2020 commented 1 year ago

Oh actually you don't need the protein family information for CAMP but if you do need that for downstreaming analysis, you can customize the column information on UniProt when downloading the sequence.

@twopin , in your paper and your shared codes, you filtered protein sequence which belong to MHC protein family. (●'◡'●)

twopin commented 1 year ago

Oh actually you don't need the protein family information for CAMP but if you do need that for downstreaming analysis, you can customize the column information on UniProt when downloading the sequence.

@twopin , in your paper and your shared codes, you filtered protein sequence which belong to MHC protein family. (●'◡'●)

Oh you mean filtering... I thought you were talking about the scripts in this Git. Actually you can get protein family from UniProt. First you have your list of multiple uniprot ids, and click "ID mapping" then load your list and click "map IDs". Then when the job finished, you click the job id and you can see a big table. Now, just click customize columns (figures below) and select protein families. You will get what you want.