mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
147 stars 13 forks source link

translate.py documentation #5

Closed lhallee closed 7 months ago

lhallee commented 7 months ago

Hello!

I just noticed a couple of things while using translate and wanted to bring some attention to them in case they are issues.

  1. is_3Di is assigned twice
    
    is_3Di = False if int(args.is_3Di) == 0 else True

split_char = args.split_char id_field = args.id

half_precision = False if int(args.half) == 0 else True is_3Di = False if int(args.is_3Di) == 0 else True


2. It seems the logic in the message 0 for 3Di 1 for AA for is_3Di is wrong

print(f"is_3Di is {is_3Di}. (0=expect input to be 3Di, 1= input is AA")

if is_3Di: # if we go from 3Di (start/s) to AA (target/t) prefix_s2t = ""

don't generate 3Di or rare/ambig. AAs when outputting AA

noGood = "acdefghiklmnpqrstvwyXBZ"

if is_3Di: sequences[ uniprot_id ] += ''.join( line.split() ).replace("-","").lower() else: sequences[ uniprot_id ] += ''.join( line.split() ).replace("-","")



3. The documentation in the scripts README for translate calls translate_clean, which doesn't seem to exist

python translate_clean.py --input /path/to/some_AA_sequences.fasta --output /path/to/output_directory --half 1 --is_3Di 0

There is a similar misnaming for the predict_3Di_encoderOnly.py

python predict_3Di.py --input /path/to/some_AA_sequences.fasta --output /path/to/some_3Di_sequences.fasta --half 1

All the best,
Logan
mheinzinger commented 7 months ago

Amazing, thank you so much for the heads-up. I think I have addressed all the issues you raised above. In case you should spot anything else, please, just let me know and I'll try to fix asap :)