liwb1219 / zhfeat

7 stars 0 forks source link

论文Figure1中的架构是不是没有在本仓库中完全实现? #4

Open laosuan opened 1 year ago

laosuan commented 1 year ago

找了一圈没有看到Topic model和bert模型协同输出text difficulty level的相关代码😂

liwb1219 commented 1 year ago

需要先制作语言学特征,传统特征使用lingfeat直接抽取,预训练主题模型运行run_corex.py(要先下载wiki预训练语料,然后过滤出符合长度要求的样本),然后用预训练主题模型推理获得难度感知的主题特征,这两部分合起来是完整的语言学特征。

liwb1219 commented 1 year ago

”Topic model和bert模型协同输出text difficulty level的相关代码“在src/model目录里面,就一个模型,前向传参除了bert的三个参数外,还有语言学特征

laosuan commented 1 year ago

需要先制作语言学特征,传统特征使用lingfeat直接抽取,预训练主题模型运行run_corex.py(要先下载wiki预训练语料,然后过滤出符合长度要求的样本),然后用预训练主题模型推理获得难度感知的主题特征,这两部分合起来是完整的语言学特征。

请问传统特征和主题特征具体怎么合起来有没有代码参考?

liwb1219 commented 1 year ago

直接拼接就可以了。传统特征提取参考代码如下(可以直接去看LingFeat的github),我没有用他的主题特征,主题特征是用run_corex.py训练的主题模型推理得到的,英文是120维度,加上这个207维度,一共是327维度。

def extract_artificial_features(text):

LingFeat = extractor.pass_text(text)

LingFeat.preprocess()

#

Advanced Semantic Features (AdSem)

WoKF = LingFeat.WoKF_() # Wikipedia Knowledge Features

WBKF = LingFeat.WBKF_() # WeeBit Corpus Knowledge Features

OSKF = LingFeat.OSKF_() # OneStopEng Corpus Knowledge Features

#

Discourse (Disco) Features

EnDF = LingFeat.EnDF_() # Entity Density Features

EnGF = LingFeat.EnGF_() # Entity Grid Features

#

Syntactic (Synta) Features

PhrF = LingFeat.PhrF_() # Noun/Verb/Adj/Adv/... Phrasal Features

TrSF = LingFeat.TrSF_() # (Parse) Tree Structural Features

POSF = LingFeat.POSF_() # Noun/Verb/Adj/Adv/... Part-of-Speech Features

#

Lexico Semantic (LxSem) Features

TTRF = LingFeat.TTRF_() # Type Token Ratio Features

VarF = LingFeat.VarF_() # Noun/Verb/Adj/Adv Variation Features

PsyF = LingFeat.PsyF_() # Psycholinguistic Difficulty of Words (AoA Kuperman)

WoLF = LingFeat.WorF_() # Word Familiarity from Frequency Count (SubtlexUS)

#

Shallow Traditional (ShTra) Features

ShaF = LingFeat.ShaF_() # Shallow Features (e.g. avg number of tokens)

TraF = LingFeat.TraF_() # Traditional Formulas

#

features_dict_list = [WoKF, WBKF, OSKF, EnDF, EnGF, PhrF, TrSF, POSF, TTRF, VarF, PsyF, WoLF, ShaF, TraF]

features_dict_list = [EnDF, EnGF, PhrF, TrSF, POSF, TTRF, VarF, PsyF, WoLF, ShaF, TraF]

#

features_list = []

for features_dict in features_dict_list:

for k, v in features_dict.items():

features_list.append(v)

#

assert len(features_list) == 255, 'Inconsistent number of features'

assert len(features_list) == 207, 'Inconsistent number of features'

#

return features_list

laosuan commented 1 year ago

需要先制作语言学特征,传统特征使用lingfeat直接抽取,预训练主题模型运行run_corex.py(要先下载wiki预训练语料,然后过滤出符合长度要求的样本),然后用预训练主题模型推理获得难度感知的主题特征,这两部分合起来是完整的语言学特征。

能否给一下wiki预训练语料的下载地址? 我在huggingface找到了一个wiki数据集https://huggingface.co/datasets/wikipedia, 这个数据集有几十G,不太像是会一次性读这么多数据到内存里.

liwb1219 commented 1 year ago

就是这个数据,只是日期不一样,需要做长度筛选300-1000,然后合并到一个文件中

laosuan commented 1 year ago

top - 19:03:13 up 33 days, 12:01, 2 users, load average: 1.02, 1.06, 0.98 Tasks: 510 total, 3 running, 474 sleeping, 33 stopped, 0 zombie %Cpu(s): 0.6 us, 9.2 sy, 0.0 ni, 89.6 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st MiB Mem : 96378.2 total, 806.4 free, 93780.0 used, 1791.8 buff/cache MiB Swap: 2048.0 total, 0.0 free, 2048.0 used. 1412.4 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

3389424 mike 20 0 167.5g 74.9g 154268 R 106.7 79.6 20:21.21 python

在运行run_corex.py时,内存不足,程序中断。 请问什么配置能跑通?

liwb1219 commented 1 year ago

看你的数据大小