hamzagamouh / protein_embeddings

2 stars 1 forks source link

adding /app/output/ to the path to input_file #3

Open ProkopDivin opened 1 year ago

ProkopDivin commented 1 year ago

when running this:

divinpr@volta05:/$ python compute_protein_embeddings.py --emb_name bert --input_dataset a.001.001.001_1s69a_A.fa  --output_folder output.txt
Import embedder...
Some weights of the model checkpoint at /home/divinpr/.cache/bio_embeddings/prottrans_bert_bfd/model_directory were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Getting sequences from dataset ...
Traceback (most recent call last):
  File "compute_protein_embeddings.py", line 84, in <module>
    for n,file in enumerate(os.listdir(input_dataset)):
FileNotFoundError: [Errno 2] No such file or directory: '/app/output/a.001.001.001_1s69a_A.fa'

the problem i that argument named input_dataset has value a.001.001.001_1s69a_A.fa, but for some reason there is in the program added the /app/output/. so the path to the input file which is pass to the script as parameter is changed. Im sure that it is mistake. The mistake will be probably in this line: input_dataset="/app/output/"+args.input_dataset im not sure, if it is supoused to be about path to the output or there is some another intention.

It also looks like there will be more bugs.
When I consider the previos mistake, it looks like this code couldn´t ever run properly. Didn`t you upload wrong version or something like this. Can you try it yourself and debug it.

hamzagamouh commented 1 year ago

@ProkopDivin Thank you for the feedback. I expect there will be some debugging in the future since the repo was created recently, and I only tested the code on my local server. The code was the basis of my master’s thesis, and it should work, only some minor changes needs to be dealt with due to some lack of attention. I apologize for any inconvenience. This error is related probably to the mounting of your local volume to the image. When you run the image as a container, you should specify which folder inside the image that will be connected to your local storage. I forgot to add an argument to interactive execution of the image ch-run --bind /home/username/files:/app/output biopython bash In the bash .sh scripts the problem shouldn’t be there (you need only to replace the source path) Please let me know if it resolves the issue.

hamzagamouh commented 1 year ago

I will add more information about how to work with —bind argument.`

hamzagamouh commented 1 year ago

I have made some changes to the instructions in interactive mode. Please have a look at them again

ProkopDivin commented 1 year ago

unfortunately it doesn`t

[divinpr@volta05 protein_embeddings]$ ls
a.001.001.001_1s69a_A.fa   compute_embeddings_gpu.sh      README.md
biopython                  compute_protein_embeddings.py  requirements.txt
compute_embeddings_cpu.sh  Dockerfile
[divinpr@volta05 protein_embeddings]$ ch-run --bind /home/divinpr/pbsprediction/protein_embeddings:/app/output biopython bash
ch-run[706080]: error: can't mkdir: /home/divinpr/pbsprediction/protein_embeddings/biopython/app/output: Read-only file system (ch_misc.c:409 30)
[divinpr@volta05 protein_embeddings]$ ls biopython/
app  boot  dev  home  lib64  mnt  proc  run   srv  tmp  var
bin  ch    etc  lib   media  opt  root  sbin  sys  usr
[divinpr@volta05 protein_embeddings]$ ls biopython/app/
a.001.001.001_1s69a_A.fa   compute_protein_embeddings.py  requirements.txt
compute_embeddings_cpu.sh  Dockerfile
compute_embeddings_gpu.sh  README.md

the output directory just can not be created

hamzagamouh commented 1 year ago

@ProkopDivin Can you create the directory manually? mkdir biopython/app/output

ProkopDivin commented 1 year ago

yes, but then, the python script canot make the outputfiles

hamzagamouh commented 1 year ago

You should see the outputs here /home/divinpr/pbsprediction/protein_embeddings (source directory)

ProkopDivin commented 1 year ago

well the script end with the error so there isnt any filles made

[divinpr@volta05 protein_embeddings]$ mkdir biopython/app/output
[divinpr@volta05 protein_embeddings]$ ch-run --bind /home/divinpr/pbsprediction/protein_embeddings:/app/output
divinpr@volta05:/$ ls
app  bin  boot  ch  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
divinpr@volta05:/$ python ./app/output/compute_protein_embeddings.py --emb_name bert --input_dataset a.001.001.
Import embedder...
Some weights of the model checkpoint at /home/divinpr/.cache/bio_embeddings/prottrans_bert_bfd/model_directory were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Getting sequences from dataset ...
1 sequences found
Getting embeddings from bert
Traceback (most recent call last):
  File "./app/output/compute_protein_embeddings.py", line 121, in <module>
    with zipfile.ZipFile(f"{output_folder}/{dataset}_{emb_name}.zip","w") as thezip:
  File "/usr/local/lib/python3.7/zipfile.py", line 1240, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: 'embeddings/a.001.001.001_1s69a_A_bert.zip'
hamzagamouh commented 1 year ago

You should specify a valid output folder as an argument. python ./app/output/compute_protein_embeddings.py --emb_name bert --input_dataset ... --output_folder ~/pbsprediction/protein_embeddings

hamzagamouh commented 1 year ago

Or you can create a folder called embeddings inside your source directory, and you should expect output data there. Here you should run python ./app/output/compute_protein_embeddings.py --emb_name bert --input_dataset ... --output_folder embeddings

ProkopDivin commented 1 year ago

this woked thank you

hamzagamouh commented 1 year ago

You're welcome.