Closed schultzjack closed 1 year ago
Dear @schultzjack , thanks for your informative error report!
I just pushed a new commit - please try again now with the latest version and let me know if it works. If so, I'll push a new conda
build as well.
I believe the issue was incorrect logic in the create_embedding_bio_sp.py
where labels were incorrectly passed.
Thank you for your prompt response, @tyronechen!
I appreciate your efforts in fixing the issue. I pulled the latest changes and re-ran the script. Unfortunately, I am still encountering the same error.
It seems like the error arises from the embed_seqs_sp() function in utils.py. The function appears to process columns of the CSV file as if they are strings. The TypeError is raised when the function tries to apply string operations to a float value, possibly suggesting that somewhere in the function, a value is not being correctly processed as a string.
Based on the analysis of the error traceback and the code, it seems that the embed_seqs_sp() function might be trying to process a column that doesn't exist in the input data or a column that contains non-string values.
Could it be possible that the function is expecting a different format or type of data than what is being provided in the train.csv file? Any insights you could provide would be greatly appreciated.
For reference, I am passing in -i (infile path), -t (tokeniser path), and -o (output_dir) when running the script.
Thanks for the info @schultzjack ,
Based on your feedback, theres a few possibilities i can think of but ill need to investigate these 1 by 1:
[[1 ... 1]]
(you can inspect a long entry in your train.csv
to rule this out - where ...
would be interpreted as str
. (this is a known bug and caused by the default numpy array display limits)[]
instead of [[]]
datasets
encodes ClassLabels
earlier, I cannot rule this out yet (but its probably less likely than the other options)Unfortunately I will only be able to look at this in detail probably next week, but if any of the above helps, or if you can think of other options please let me know (and feel free to submit a pull request if you choose to do so ٩(◕‿◕)۶)
Hello.
I am facing the same issue. I do not think it is truncation, as this is fixable with np.set_printoptions(threshold=np.inf)
which I have set at the top utils.py, resulting csvs that are no longer truncated.
Thanks for letting me know @hannaglad , I'll focus my investigation on the 2nd and 3rd points 👍
Unfortunately I will only be able to look at this in detail probably next week, but if any of the above helps, or if you can think of other options please let me know (and feel free to submit a pull request if you choose to do so ٩(◕‿◕)۶)
Sorry all, i might have to postpone this until late August.
Hello,
I believe I may have figured out how to fix it, although I am unsure if it is correct. It seems to be working but I'm no expert, and I'm not exactly sure what I am looking for except no errors and an embeddings.csv.
In create_embeddings_bio_sp.py
, I modified the following code before the first call to Word2Vec
(in the if model == None:
statement
all_kmers = itertools.chain()
for i in kmers:
all_kmers = itertools.chain(all_kmers, i)
real_kmers = itertools.chain()
for i, j in all_kmers:
real_kmers = itertools.chain(real_kmers, [i])
model = Word2Vec(
sentences=real_kmers,
vector_size=w2v_vector_size,
window=w2v_window,
min_count=w2v_min_count,
sg=w2v_sg,
workers=njobs
)
I think the error was coming from word2vec trying to create a vocabulary for the label (1 or 0)
I obtain a file (kmers_embeddings.csv
) with one line per kmer starting with the label, the string, and then 100 float values that I assume are the embeddings. It would be great if you could confirm that this is the expected behaviour.
Thank you for this really cool project :)
I think the error was coming from word2vec trying to create a vocabulary for the label (1 or 0)
I took a closer look and it does seem to be the case, the aim of that step is to parse out the embeddings. It appears the tokens are not generated correctly, which is likely the cause of the float not subscriptable error everyone is getting (it is trying to index a NaN
).
Interestingly, I was unable to replicate the fix by the method you specified, but I narrowed down the most likely low-level cause of the issue now.
I obtain a file (kmers_embeddings.csv) with one line per kmer starting with the label, the string, and then 100 float values that I assume are the embeddings. It would be great if you could confirm that this is the expected behaviour.
Yes, this is the intended behaviour: label(int),kmer(str),embedding(float)
0,AAAAA,-0.122,-0.313,...
...
Thank you for this really cool project :)
Hope you find it useful, if you see any other bugs please do not hesitate to let us know
Interestingly, I was unable to replicate the fix by the method you specified, but I narrowed down the most likely low-level cause of the issue now.
Nevermind, this was one component of the fix, a second component was that an intermediate file was being parsed incorrectly (column frameshift caused by an unintended null column)
This should work as expected now, ran a few small tests and no issues so far. v2.4.4
is building and will be online shortly.
Please install with:
mamba install -c tyronechen -c conda-forge genomenlp==2.4.4
In case there are still any bugs related to this, I will leave this issue open until the end of the month and close it if there are no further bug reports.
I think the error was coming from word2vec trying to create a vocabulary for the label (1 or 0)
I took a closer look and it does seem to be the case, the aim of that step is to parse out the embeddings. It appears the tokens are not generated correctly, which is likely the cause of the float not subscriptable error everyone is getting (it is trying to index a
NaN
).Interestingly, I was unable to replicate the fix by the method you specified, but I narrowed down the most likely low-level cause of the issue now.
Thank you for posting your thoughts regarding this issue and what you're saying aligns with what I was seeing with how it was trying to index a NaN
. I thought it was more related to how pandas handles missing values and empty strings and how, when reading a CSV file, pandas treats both as NaN
. Given what you said about the tokens not getting generated correctly, it seems like the issue is less about simply preserving empty strings when loading DataFrames, so strategies like using the fillna
function to convert NaN
values to empty strings won't actually resolve the core issue (even if they eliminate that particular error).
I'm not doing further testing on this issue at the moment as I'm focusing on running the tokenise_bio.py script with the human genome fasta featured in your preprint. I'll use a separate issue for that.
Thank you for your creating this very interesting project and for your help with investigating these issues!
No problem @schultzjack , in that case I will close it for now, but if this resurfaces let me know and I will reopen to investigate.
Describe the bug: When I attempt to run the
create_embedding_bio_sp.py
script, it fails with aTypeError: 'float' object is not subscriptable
. The error seems to be due to the script's inability to handle 'float' values in the data.To Reproduce: Here are the exact commands that I ran:
tokenise_bio.py
script:create_dataset_bio.py
script:create_embedding_bio_sp.py
script:Expected behavior: Expected
create_embedding_bio_sp.py
script to process the dataset and generate the embeddings without errors.Error message: TypeError: 'float' object is not subscriptable
Suggested fix: Unclear. Issue appears to be in
utils.py
at line 895, in<lambda>
: sp = data[column].apply(lambda x: x[1:-1].replace("\'", "").split())