snap-stanford / SATURN

MIT License
106 stars 17 forks source link

TypeError: descriptor 'union' of 'set' object needs an argument when generate protein embedding #76

Closed zhuojiuqingyun closed 4 days ago

zhuojiuqingyun commented 1 week ago

Dear authors, Thank you very much for making this great tool. I got the following error while running Generate Protein Embeddings.ipynb using my protein fasta.

!python map_gene_symbol_to_protein_ids.py \
    --fasta_path ./data/{NAME}.fa \
    --save_path ./data/{NAME}.gene_symbol_to_protein_ID.json

!python convert_protein_embeddings_to_gene_embeddings.py \
    --embedding_dir ./data/{NAME}.clean.fa_esm2 \
    --gene_symbol_to_protein_ids_path ./data/{NAME}.gene_symbol_to_protein_ID.json \
    --embedding_model ESM2 \
    --save_path ./data/{NAME}.gene_symbol_to_embedding_ESM2.pt
100%|█████████████████████████████████| 28010/28010 [00:00<00:00, 673321.35it/s]
28010
Traceback (most recent call last):
  File "map_gene_symbol_to_protein_ids.py", line 63, in <module>
    map_gene_symbol_to_protein_ids(Args().parse_args())
  File "map_gene_symbol_to_protein_ids.py", line 46, in map_gene_symbol_to_protein_ids
    all_protein_ids = set.union(*[protein_ids for protein_ids in gene_symbol_to_protein_ids.values()])
TypeError: descriptor 'union' of 'set' object needs an argument
Traceback (most recent call last):
  File "convert_protein_embeddings_to_gene_embeddings.py", line 86, in <module>
    convert_protein_embeddings_to_gene_embeddings(Args().parse_args())
  File "convert_protein_embeddings_to_gene_embeddings.py", line 27, in convert_protein_embeddings_to_gene_embeddings
    with open(args.gene_symbol_to_protein_ids_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/xenia.proteins.gene_symbol_to_protein_ID.json'

I didn't encounter this error when I running the example Xenopus_tropicalis. Could you help me with my issue?

Yours sincerely. Ruijie

Yanay1 commented 4 days ago

Please double check and make sure the path is correct:

FileNotFoundError: [Errno 2] No such file or directory: 'data/xenia.proteins.gene_symbol_to_protein_ID.json'

The error is due to no file existing at that location.

zhuojiuqingyun commented 3 days ago

Please double check and make sure the path is correct:

FileNotFoundError: [Errno 2] No such file or directory: 'data/xenia.proteins.gene_symbol_to_protein_ID.json'

The error is due to no file existing at that location.

Thanks for your reply. In fact, it's this error which led to data/xenia.proteins.gene_symbol_to_protein_ID.json not being generated. TypeError: descriptor 'union' of 'set' object needs an argument I think the reason is that I couldn't find the protein fasta from ensembl, so I downloaded from NCBI. So my protein fasta description doesn't contain gene symbol and protein id ,which caused the error above. How can I generate protein embedding from protein fasta not downloaded from Ensembl? Could you give me some instructions?

Thanks a lot!

Ruijie

zhuojiuqingyun commented 3 days ago

Moreover, I want to use SATURN for Xenia to cross species annotate,but we don't know the exact gene_symbol of related protein_id. Will the embedding created influnce the training of SATURN? For example, the description of my protein fasta and part of gtf are as follows.

>Xe_029168-T1 Xe_029168
MSSTEEEVEFDIEYIATEVQPYMFEPLASSNNVETDEDLSSSSSTDSSSDEYTHRIGNTNWCECGHCVAMTTGRESICCHEEPKTDPKIHGDHLCIT

HiC_scaffold_1 GenBank CDS 333245 333499 . - 0 transcript_id "Xe_000054-T1"; gene_id "Xe_000054"; Are the exact gene_symbols necessary? Or they can be replaced by those serial numbers. Could you give me some advices? Thank you very much!