Script `create_embedding_bio_sp.py` fails with 'float' object is not subscriptable error

schultzjack commented 1 year ago

Describe the bug: When I attempt to run the create_embedding_bio_sp.py script, it fails with a TypeError: 'float' object is not subscriptable. The error seems to be due to the script's inability to handle 'float' values in the data.

To Reproduce: Here are the exact commands that I ran:

Set up the Python environment and install dependencies.

Run the tokenise_bio.py script:

source activate myenv && python "tokenise_bio.py" -i "<your_path>/storage/FASTA_small/smallfasta1.fasta" -t "<your_path>/storage/tokenizer-001.json"

Run the create_dataset_bio.py script:

source activate myenv && python "create_dataset_bio.py" "<your_path>/storage/FASTA_small/smallfasta1.fasta" "<your_path>/storage/FASTA_small/smallfasta2.fasta" "<your_path>/storage/tokenizer-001.json" -c 200 -o "<your_path>/storage/dataset-bio-003"

Run the create_embedding_bio_sp.py script:

source activate myenv && python "create_embedding_bio_sp.py" -i "<your_path>/storage/dataset-bio-003/train.csv" -t "<your_path>/storage/tokenizer-001.json" -o "<your_path>/storage/embeddings-bio-002"

Expected behavior: Expected create_embedding_bio_sp.py script to process the dataset and generate the embeddings without errors.

Error message: TypeError: 'float' object is not subscriptable

Suggested fix: Unclear. Issue appears to be in utils.py at line 895, in <lambda>: sp = data[column].apply(lambda x: x[1:-1].replace("\'", "").split())

tyronechen commented 1 year ago

Dear @schultzjack , thanks for your informative error report!

I just pushed a new commit - please try again now with the latest version and let me know if it works. If so, I'll push a new conda build as well.

I believe the issue was incorrect logic in the create_embedding_bio_sp.py where labels were incorrectly passed.

schultzjack commented 1 year ago

Thank you for your prompt response, @tyronechen!

I appreciate your efforts in fixing the issue. I pulled the latest changes and re-ran the script. Unfortunately, I am still encountering the same error.

It seems like the error arises from the embed_seqs_sp() function in utils.py. The function appears to process columns of the CSV file as if they are strings. The TypeError is raised when the function tries to apply string operations to a float value, possibly suggesting that somewhere in the function, a value is not being correctly processed as a string.

Based on the analysis of the error traceback and the code, it seems that the embed_seqs_sp() function might be trying to process a column that doesn't exist in the input data or a column that contains non-string values.

Could it be possible that the function is expecting a different format or type of data than what is being provided in the train.csv file? Any insights you could provide would be greatly appreciated.

For reference, I am passing in -i (infile path), -t (tokeniser path), and -o (output_dir) when running the script.

tyronechen commented 1 year ago

Thanks for the info @schultzjack ,

Based on your feedback, theres a few possibilities i can think of but ill need to investigate these 1 by 1:

Depending on your read length, the token array may be truncated to [[1 ... 1]] (you can inspect a long entry in your train.csv to rule this out - where ... would be interpreted as str. (this is a known bug and caused by the default numpy array display limits)
I may be iterating over the data at an incorrect level, ie [] instead of [[]]
I was having an issue with the way datasets encodes ClassLabels earlier, I cannot rule this out yet (but its probably less likely than the other options)

Unfortunately I will only be able to look at this in detail probably next week, but if any of the above helps, or if you can think of other options please let me know (and feel free to submit a pull request if you choose to do so ٩(◕‿◕)۶)

hannaglad commented 1 year ago

Hello.

I am facing the same issue. I do not think it is truncation, as this is fixable with np.set_printoptions(threshold=np.inf) which I have set at the top utils.py, resulting csvs that are no longer truncated.

tyronechen commented 1 year ago

Thanks for letting me know @hannaglad , I'll focus my investigation on the 2nd and 3rd points 👍

tyronechen commented 1 year ago

Unfortunately I will only be able to look at this in detail probably next week, but if any of the above helps, or if you can think of other options please let me know (and feel free to submit a pull request if you choose to do so ٩(◕‿◕)۶)

Sorry all, i might have to postpone this until late August.

hannaglad commented 1 year ago

Hello,

I believe I may have figured out how to fix it, although I am unsure if it is correct. It seems to be working but I'm no expert, and I'm not exactly sure what I am looking for except no errors and an embeddings.csv.

In create_embeddings_bio_sp.py, I modified the following code before the first call to Word2Vec (in the if model == None: statement

 all_kmers = itertools.chain()
        for i  in kmers:
            all_kmers = itertools.chain(all_kmers, i)

real_kmers = itertools.chain()
        for i, j in all_kmers:
            real_kmers = itertools.chain(real_kmers, [i])

 model = Word2Vec(
            sentences=real_kmers,
            vector_size=w2v_vector_size,
            window=w2v_window,
            min_count=w2v_min_count,
            sg=w2v_sg,
            workers=njobs
            )

I think the error was coming from word2vec trying to create a vocabulary for the label (1 or 0)

I obtain a file (kmers_embeddings.csv) with one line per kmer starting with the label, the string, and then 100 float values that I assume are the embeddings. It would be great if you could confirm that this is the expected behaviour.

Thank you for this really cool project :)

tyronechen commented 1 year ago

I think the error was coming from word2vec trying to create a vocabulary for the label (1 or 0)

I took a closer look and it does seem to be the case, the aim of that step is to parse out the embeddings. It appears the tokens are not generated correctly, which is likely the cause of the float not subscriptable error everyone is getting (it is trying to index a NaN).

Interestingly, I was unable to replicate the fix by the method you specified, but I narrowed down the most likely low-level cause of the issue now.

I obtain a file (kmers_embeddings.csv) with one line per kmer starting with the label, the string, and then 100 float values that I assume are the embeddings. It would be great if you could confirm that this is the expected behaviour.

Yes, this is the intended behaviour: label(int),kmer(str),embedding(float)

0,AAAAA,-0.122,-0.313,...
...

Thank you for this really cool project :)

Hope you find it useful, if you see any other bugs please do not hesitate to let us know

tyronechen commented 1 year ago

Interestingly, I was unable to replicate the fix by the method you specified, but I narrowed down the most likely low-level cause of the issue now.

Nevermind, this was one component of the fix, a second component was that an intermediate file was being parsed incorrectly (column frameshift caused by an unintended null column)

This should work as expected now, ran a few small tests and no issues so far. v2.4.4 is building and will be online shortly.

Please install with:

mamba install -c tyronechen -c conda-forge genomenlp==2.4.4

In case there are still any bugs related to this, I will leave this issue open until the end of the month and close it if there are no further bug reports.

schultzjack commented 1 year ago

I think the error was coming from word2vec trying to create a vocabulary for the label (1 or 0)

I took a closer look and it does seem to be the case, the aim of that step is to parse out the embeddings. It appears the tokens are not generated correctly, which is likely the cause of the float not subscriptable error everyone is getting (it is trying to index a NaN).

Interestingly, I was unable to replicate the fix by the method you specified, but I narrowed down the most likely low-level cause of the issue now.

Thank you for posting your thoughts regarding this issue and what you're saying aligns with what I was seeing with how it was trying to index a NaN. I thought it was more related to how pandas handles missing values and empty strings and how, when reading a CSV file, pandas treats both as NaN. Given what you said about the tokens not getting generated correctly, it seems like the issue is less about simply preserving empty strings when loading DataFrames, so strategies like using the fillna function to convert NaN values to empty strings won't actually resolve the core issue (even if they eliminate that particular error).

I'm not doing further testing on this issue at the moment as I'm focusing on running the tokenise_bio.py script with the human genome fasta featured in your preprint. I'll use a separate issue for that.

Thank you for your creating this very interesting project and for your help with investigating these issues!

tyronechen commented 1 year ago

No problem @schultzjack , in that case I will close it for now, but if this resurfaces let me know and I will reopen to investigate.

tyronechen / genomenlp

Script `create_embedding_bio_sp.py` fails with 'float' object is not subscriptable error #3