Open war24ever opened 1 year ago
halo.
bisa lihat isi variable listOfEntities
nya?
di bawah ini potongan kode BIOtagging.py. saya tidak mengubah kode pada file BIOtagging.py ini. atau memang perlu ada penyesuaian kah.
`def convert_to_spaCyformat(df, listOfEntities): """ This function is used to convert the BIO-tagged-DF to spaCy format annotations.
Args:
- df (pandas.DataFrame) > BIO-tagged dataframe consisting of two columns, i.e. token and BIO_tag
- listOfEntities (list) > list of entities/annotations used
Return:
- [text, enti] > a list consisting of the text (combined from the tokens) and the interested entities as accordance with spaCy format
"""
# check if NaN exists
assert not (df.iloc[:,0].isnull().any() or df.iloc[:,1].isnull().any()), 'The dataset contains nan value.'
# create a dictionary to save the columns of 'token' and 'BIO_tag', and we also define the index of tokens in order
dictTemp = {}
dictTemp['token'] = np.array(df.iloc[:,0])
dictTemp['BIO_tag'] = np.array(df.iloc[:,1].str.lower())
dictTemp['indices'] = np.array([len(i) for i in dictTemp['token']])
# first, we need to get the index of the first token
total_idx = [dictTemp['indices'][0]]
temp = dictTemp['indices'][0]
# then we use for loop to count index for each token in cumulative
for i in range(len(dictTemp['indices'])):
if i > 0:
temp += dictTemp['indices'][i]
total_idx.append(temp)
# create variable for the start index of each token
dictTemp['start_idx'] = np.array([total_idx[i-1] if i > 0 else 0 for i in range(len(total_idx))])
# create variable for the last index of each token
dictTemp['end_idx'] = np.array(total_idx)
del dictTemp['indices'] # we no longer need variable indices. then remove it.
enti = {}
entities = []
text = ''.join(dictTemp['token'])
# combine each of listOfEntities with prefix 'b-', 'i-', and 'e-', and add 'o' annotation
listOfEntities = ['b-'+i.lower() for i in listOfEntities] + \
['i-'+i.lower() for i in listOfEntities] + \
['e-'+i.lower() for i in listOfEntities] + ['o']
# check if each BIO-tag is in listOfEntities
error_tag = []
error_boolean = []
for i in np.unique(dictTemp['BIO_tag']):
if i in listOfEntities:
error_boolean.append(True)
else:
error_boolean.append(False)
error_tag.append(i)
assert all(error_boolean), "Some BIO-tag not listed in listOfEntities arg. {}".format(error_tag)
# fill in entities list with all non 'O' annotations
for row in range(len(dictTemp['token'])):
if dictTemp['BIO_tag'][row] != 'o':
entities.append((dictTemp['start_idx'][row],
dictTemp['end_idx'][row],
dictTemp['BIO_tag'][row]))
start = []
end = []
BIO = []
i = 0
while i < len(entities):
try:
if entities[i][2][2:] == entities[i+1][2][2:]:
if entities[i][2][0] is 'b':
start.append(entities[i][0])
i += 1
if entities[i][2][0] is 'e':
end.append(entities[i][1])
BIO.append(entities[i][2][2:])
i += 1
continue
elif entities[i][2][0] is 'i':
for j in range(i, len(entities)):
if entities[j][2][0] is not 'e' and j < len(entities)-1:
continue
elif entities[j][2][0] is 'e':
end.append(entities[j][1])
BIO.append(entities[j][2][2:])
i = j+1
break
else:
assert 1 == 0, \
"Something error in the BIO-tag you wrote. Error BIO tag: '{}'" \
.format(entities[j][2])
elif entities[i][2][0] is 'b':
end.append(entities[i-1][1])
BIO.append(entities[i-1][2][2:])
continue
else:
start.append(entities[i][0])
end.append(entities[i][1])
BIO.append(entities[i][2][2:])
i += 1
except IndexError:
start.append(entities[i][0])
end.append(entities[i][1])
BIO.append(entities[i][2][2:])
i += 1
enti['entities'] = [(i,j,k) for i,j,k in zip(start, end, BIO)]
return [text, enti]`
Kalo dataset yang sudah saya beri entitas secara manual seperti saya lampirkan di chat pertama.
oke. tapi bisa minta tolong tunjukkan ke saya isi variable annotations
atau listOfEntities
yang ada di line ini train_data = convert_to_spaCyformat(df_tagged, annotations)
?
karena saya perlu cek apakah isi variable tersebut sudah mengcover seluruh kebutuhan entities yang digunakan di file CSV terlampir
ini variabel nya mas.
oke. saya coba reproduce error nya dulu ya. mungkin agak lama ya soalnya saya sambil mengerjakan hal lain.
Halo mas Reza, saya sedang mencoba untuk scraping berita baru dengan cara mengganti alamat website sumber berita (dari kompas.com) yaitu . Saya kemudian melakukan tagging manual file csv sesuai dengan tutorial, namun pada saat running code pada baris
train_data = convert_to_spaCyformat(df_tagged, annotations)
terjadi error dengan traceback seperti di bawah ini. `--------------------------------------------------------------------------- AssertionError Traceback (most recent call last)