lxing532 / improve_topic_seg

18 stars 3 forks source link

about data #2

Open shaoyangxu opened 3 years ago

shaoyangxu commented 3 years ago

hello, i have two question: 1). what do you mean by the code below: bound1 = 14900; bound2 = 2135; bound3 = 50; bound4 = 100; bound5 = 117; bound6 = 227; 2). and, how can i get the files below: image

shaoyangxu commented 3 years ago

hello, i have two question: 1). what do you mean by the code below: bound1 = 14900; bound2 = 2135; bound3 = 50; bound4 = 100; bound5 = 117; bound6 = 227; 2). and, how can i get the files below: image

actually, i can guess that: 50 for wiki-50 110 for CITIES 117 for ELEMENTS 227 for CLINICAL However the two number: 14900 and 2135 are confusing to me, especially, have you did some preprocessing actions for the corpus with 14900 and 2135 documents respectively? because i find 'cleaned' in the path /ubc/cs/research/nlp/Linzi/seg/bert/bert_emb_train_cleaned.txt and '/ubc/cs/research/nlp/Linzi/seg/bert/bert_emb_dev_cleaned.txt'

shaoyangxu commented 3 years ago

hello, i have two question: 1). what do you mean by the code below: bound1 = 14900; bound2 = 2135; bound3 = 50; bound4 = 100; bound5 = 117; bound6 = 227; 2). and, how can i get the files below: image

actually, i can guess that: 50 for wiki-50 110 for CITIES 117 for ELEMENTS 227 for CLINICAL However the two number: 14900 and 2135 are confusing to me, especially, have you did some preprocessing actions for the corpus with 14900 and 2135 documents respectively? because i find 'cleaned' in the path /ubc/cs/research/nlp/Linzi/seg/bert/bert_emb_train_cleaned.txt and '/ubc/cs/research/nlp/Linzi/seg/bert/bert_emb_dev_cleaned.txt'

i have downloaded wikisection and found there are 16,192(2513 for disease+13679 for city) english train examples, the number is not the same as 14900.

lxing532 commented 3 years ago

Hi beiweixiaoxu,

Thanks for your interest!

About the size of the wiki-section data, since we found that after a few preprocessing steps (eg., removing special tokens and the words missing in word2vec), some documents will contain no content, we removed these documents.

About how to obtain the Bert sentence embedding, please refer to the section "BERT sentence embedding" in our README file.

Please let me know if you have any further question.

shaoyangxu commented 3 years ago

Hi beiweixiaoxu,

Thanks for your interest!

About the size of the wiki-section data, since we found that after a few preprocessing steps (eg., removing special tokens and the words missing in word2vec), some documents will contain no content, we removed these documents.

About how to obtain the Bert sentence embedding, please refer to the section "BERT sentence embedding" in our README file.

Please let me know if you have any further question.

Thanks for replying!

  1. Can you share the code for preprocessing wiki-section data with me? I suppose the 'special tokens' mentioned by you are not the same ones as those in wiki_utils.py: image
  2. I've collected the corpus mentioned in your paper: wikisection\choi\wiki50\CITIES\ELEMENTS\CLINICAL\section-ZH, however, i couldn't find RULES in《Hall of mirrors: Corporate philanthropy and strategic advocacy》, so, i will appreciate it if you can share RULES corpus with me. Thanks again!!!
lxing532 commented 3 years ago

Hi beiweixiaoxu,

The I used the same preprocessing code as in the original project. I think the main reason of the empty files after processing is the words not in word2vec dict.

About the RULES dataset, it is available upon request. I will send you it in private.

xiaonan6 commented 3 years ago

Thanks for sharing your great work! If possible, could I please have the Bert sentence embedding files? It took me too long to get all those files by using the methods mentioned in section "BERT sentence embedding". Much appreciated!

logan-siyao-peng commented 2 years ago

Hi @lxing532 , I have a similar question to @xiaonan6 . Running bert-as-service on Wiki-727K takes too long (weeks) and the stored files would be enormous if we save each sentence tensor as a list of real numbers. In practice, are there any tricks to data preprocessing? It would be great if you can share the pre-trained model as well.