KeyError: '\xf0\x93\x86\x8e\xf0\x93\x85\x93\xf0\x93\x8f\x8f\xf0\x93\x8a\x96'

zhixiaochuan12 commented 5 years ago

Hello, I need to reproduce the results on a subset of your dataset and I met some problems including pid killed in parsing, ascii error in create_*_1.py and key error in create_*_2.py. Some of them are the same as @wt123u in another issue.

I delete & before line 40 in create_*.sh to solve the pid killed problem.

I add sys.setdefaultencoding('utf-8') to solve the ascii error.

Then I met the KeyError in create_*_2.py, I tried to solve it by putting x_id, y_id, path_id = term_to_id_db[x], term_to_id_db[y], path_to_id_db.get(path, -1) to the try block, finally I got a db file nearly 70GB. When I train the model, it shows Pairs without paths: 1549 , all dataset: 20314. Continuing to train can damage the results, so it would be unfair.

I am using the 20181201 version of wiki dump and spacy 1.9.0, can the different versions or the above changes be the reason of KeyError? What can I do to get fair results? Thanks!

vered1986 commented 5 years ago

Hi, sorry for the slow response.

I'm not sure what the issue is, but:

If you are using the HypeNET dataset, there are not supposed to be any pairs without paths. The log file should look something like this:

Loading the dataset...
Done!
Initializing word embeddings...
Loading path files...
Loading the corpus...
Done!
Pairs without paths: 0 , all dataset: 70679
Done!
Number of lemmas 400001, number of pos tags: 16, number of dependency labels: 46, number of directions: 6
Creating the network...
Done!
Training with learning rate = 0.001000, dropout = 0.300000...
Training the model...
Epoch 1 / 3 Loss = 0.115487417519
Epoch 2 / 3 Loss = 0.0726631993667
Epoch 3 / 3 Loss = 0.0535255032689
Done!
Evaluation:
Precision: 0.933, Recall: 0.894, F1: 0.913

(with slightly lower numbers in V1, this is from running V2).

So this already tells us there is something wrong with the preprocessing of the corpus. Specifically, I would suspect this ascii error which I didn't get in this project before. Where exactly does it happen? What were the commands you ran (including parameters)?

Which dataset are you using? Is it the lexical split?
Putting the key search inside try-except is a good best practice, but in this setting specifically the entire vocabulary should exist. So I can only guess it goes into the "except" part many times and therefore many pairs end up with no paths. Again, probably related to the encoding issue.
The last spacy version this code was tested on is 1.6. I've already seen significant changes when using different versions, so there is a small chance it is also causing this issue.
We used a Wiki dump from 2015, but it shouldn't matter that much (only for the model performance, but not for the corpus processing, and not so significantly).

zhixiaochuan12 commented 5 years ago

Thanks for your reply!

As for the ascii error, it happened when I ran the ./create_*.sh, here is the log,

Creating the resource from the triplets file... Saving the paths... Traceback (most recent call last): File "create_resource_from_corpus_1.py", line 73, in main() File "create_resource_from_corpus_1.py", line 44, in main id, path = str(id), str(path) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 1 5: ordinal not in range(128)

I am using subsets of your random and lexical dataset, the details are on the subset of the lexical dataset.

From your words, I find that the info showed during training is different, which means the spacy version does make some changes.

Number of lemmas 400001, number of pos tags: 17, number of dependency labels: 60 , number of directions: 6

Also I found the KeyError line would have some strange characters, like: I guess this would be the result of my setting the default encoding, the strange characters would be encoded differently? But it should not influence the coverage of dataset, because there are no strange characters in the dataset.

I would change spacy version first and see what will happen.

vered1986 commented 5 years ago

OK, I see now that the problem is that one of the paths contains a bad character. I'm not sure why this is happening but I can think of a workaround / way to figure out what's going on by using this path file instead in create_resource_from_corpus_1.py. These are the common paths in the follow up project LexNET.

The terms in your example do seem weird, I don't know if there are such terms in the dataset. I think it may be because of your encoding definition.

Edit 03/04: This workaround will actually not work because the path format changed. You can download the HypeNET frequent paths from here.

JohnDzeng commented 3 years ago

Hi, the HypeNET frequent paths zip file seems to be broken. Can you check that out? Thank you very much!

vered1986 commented 3 years ago

Hi @JohnDzeng, the link works for me. I'm not sure why it doesn't work for you.

JohnDzeng commented 3 years ago

@vered1986 Thanks for your reply! I'm facing the unzip error “End-of-central-directory signature not found”. Is it possible that the zip file is corrupt?

vered1986 commented 3 years ago

You're right, sorry about that, it must have stopped uploading to Google Drive before uploading the entire file. Given that I don't have access to the original file anymore (it was on the lab server in my previous lab), I can suggest either computing the frequent paths with create_resource_from_corpus.sh while catching the UnicodeEncodeError errors, or switching to LexNET for which all files should still be available.

JohnDzeng commented 3 years ago

OK, thank you very much!

vered1986 / HypeNET

KeyError: '\xf0\x93\x86\x8e\xf0\x93\x85\x93\xf0\x93\x8f\x8f\xf0\x93\x8a\x96' #7