jackievaleri / BioAutoMATED

Automated machine learning for analyzing, interpreting, and designing biological sequences
MIT License
162 stars 20 forks source link

Missing Files When Pulling Docker Repository #7

Open Sebatina opened 4 months ago

Sebatina commented 4 months ago

Hello, As a newcomer to Docker, I attempted to pull the Docker repository for BioAutoMATED. However, I encountered an issue where certain files were missing from the pulled repository. This issue presents a challenge for me as I am still learning to navigate Docker environments and rely on having all neccessary files available.

Steps to Reproduce:

  1. Executed the command docker pull jackievaleri/bioautomated:v5 to pull the Docker repository.
  2. Observed that certain files or directories are missing from the pulled repository.

I expected all necessary files and directories to be present in the pulled Docker repository. But Certain files or directories are missing, making it impossible to proceed with the installation and usage of BioAutoMATED.

Details of Missing Items:

  1. Missing Directory: 'benchmarking'
  2. Details of clean_data Directory:
    • Only the 'clean' folder is present.
    • Expected files such as 'hollerer_rbs_train.csv', 'peptides.csv', 'small_synthetic.csv', and 'toeholds.csv' are missing.
  3. Missing Directory: 'exemplars'
    • Only the 'small_synthetic_nucleic_acids' folder is present.

These are some of the files that I mentioned as missing.

Environment

Additional Information

As a beginner in Docker, I may not be aware of potential troubleshooting steps or alternative solutions to address this issue. I have attempted to pull the Docker repository multiple times, but the issue persists. Additionally, I have checked the repository on GitHub to verify that the missing files are indeed absent from the source.

Request for Assistance

Given my limited experience with Docker, I kindly request assistance in resolving this issue and obtaining the missing files. Any guidance or suggestions tailored to a newcomer's perspective would be greatly appreciated.

Thank you for your understanding and support.

jackievaleri commented 4 months ago

Hi Sebatina,

Thank you so much for your very detailed report. What you are describing is the expected behavior. All the data to reproduce the manuscript figures is provided in the GitHub repository, but a more minimal set of files is provided in the Docker repository due to the already-substantial time needed to pull the Docker images. When we had more data in the Docker image, it took more time to download the information from Docker onto one's local machine.

However, if you would like to add data from the GitHub repository into the Docker repository, you can do this. You can clone the GitHub repository onto your local machine with the command git clone https://github.com/jackievaleri/BioAutoMATED.git BioAutoMATED. Then, you can upload files from that repository into the Jupyter interface as you normally would.

Please let me know if this helps address your question, and happy to provide additional support if needed.

Sebatina commented 3 months ago

Hi Jackievaleri,

Got it, thanks for the clarification!

I'll go ahead and clone the GitHub repository to access the additional files needed. Appreciate your guidance on this.

jackievaleri commented 3 months ago

Great! I'm going to close this issue but please feel free to reach out if other questions come up.

Sebatina commented 3 months ago

Hi, I have a query regarding the feature extraction functionality in BioAutoMATED. My dataset comprises approximately 50,000 sequences stored in a single column. Unlike typical datasets, these sequences do not have any associated features.

Could you kindly advise on the appropriate approach to utilize BioAutoMATED for extracting features from these sequences as part of an AutoML pipeline? Any guidance or recommendations you could provide would be immensely helpful.

jackievaleri commented 3 months ago

Thank you for reaching out about the feature extraction functionality in BioAutoMATED. Based on your description, BioAutoMATED may not be the ideal tool for your specific use case. BioAutoMATED is designed to map sequences to a single binary value, continuous value, or categorical value, which may not align with your need to extract features from 50,000 sequences with no associated features.

Our tool is optimized for mapping a single sequence to a single value, such as in cases where you have a specific sequence of interest (e.g., a protein sequence) and a corresponding value (e.g., immunogenicity of that protein). In this scenario, you would provide a CSV or Excel file with a column for sequences and a column for values, allowing BioAutoMATED to create a model based on the sequence-function relationship.

However, for datasets with multiple sequences and no associated features, other tools may be more appropriate for feature extraction and AutoML pipelines. We recommend exploring tools specifically designed for handling large datasets of sequences without associated features. In particular, you may want to explore iLearnPlus, which has a robust set of feature extraction options for nucleic acid and protein sequences: https://ilearnplus.erc.monash.edu