Open shaoyangxu opened 3 years ago
hello, i have two question: 1). what do you mean by the code below:
bound1 = 14900; bound2 = 2135; bound3 = 50; bound4 = 100; bound5 = 117; bound6 = 227;
2). and, how can i get the files below:
actually, i can guess that:
50 for wiki-50
110 for CITIES
117 for ELEMENTS
227 for CLINICAL
However the two number: 14900 and 2135 are confusing to me, especially, have you did some preprocessing actions for the corpus with 14900 and 2135 documents respectively? because i find 'cleaned' in the path /ubc/cs/research/nlp/Linzi/seg/bert/bert_emb_train_cleaned.txt
and '/ubc/cs/research/nlp/Linzi/seg/bert/bert_emb_dev_cleaned.txt'
hello, i have two question: 1). what do you mean by the code below:
bound1 = 14900; bound2 = 2135; bound3 = 50; bound4 = 100; bound5 = 117; bound6 = 227;
2). and, how can i get the files below:actually, i can guess that: 50 for wiki-50 110 for CITIES 117 for ELEMENTS 227 for CLINICAL However the two number: 14900 and 2135 are confusing to me, especially, have you did some preprocessing actions for the corpus with 14900 and 2135 documents respectively? because i find 'cleaned' in the path
/ubc/cs/research/nlp/Linzi/seg/bert/bert_emb_train_cleaned.txt
and'/ubc/cs/research/nlp/Linzi/seg/bert/bert_emb_dev_cleaned.txt'
i have downloaded wikisection and found there are 16,192(2513 for disease+13679 for city) english train examples, the number is not the same as 14900.
Hi beiweixiaoxu,
Thanks for your interest!
About the size of the wiki-section data, since we found that after a few preprocessing steps (eg., removing special tokens and the words missing in word2vec), some documents will contain no content, we removed these documents.
About how to obtain the Bert sentence embedding, please refer to the section "BERT sentence embedding" in our README file.
Please let me know if you have any further question.
Hi beiweixiaoxu,
Thanks for your interest!
About the size of the wiki-section data, since we found that after a few preprocessing steps (eg., removing special tokens and the words missing in word2vec), some documents will contain no content, we removed these documents.
About how to obtain the Bert sentence embedding, please refer to the section "BERT sentence embedding" in our README file.
Please let me know if you have any further question.
Thanks for replying!
Hi beiweixiaoxu,
The I used the same preprocessing code as in the original project. I think the main reason of the empty files after processing is the words not in word2vec dict.
About the RULES dataset, it is available upon request. I will send you it in private.
Thanks for sharing your great work! If possible, could I please have the Bert sentence embedding files? It took me too long to get all those files by using the methods mentioned in section "BERT sentence embedding". Much appreciated!
Hi @lxing532 , I have a similar question to @xiaonan6 . Running bert-as-service on Wiki-727K takes too long (weeks) and the stored files would be enormous if we save each sentence tensor as a list of real numbers. In practice, are there any tricks to data preprocessing? It would be great if you can share the pre-trained model as well.
hello, i have two question: 1). what do you mean by the code below:
bound1 = 14900; bound2 = 2135; bound3 = 50; bound4 = 100; bound5 = 117; bound6 = 227;
2). and, how can i get the files below: