utomoreza / spaCy-NER

This repo is about how-to-use Indonesian NER with spaCy
MIT License
17 stars 10 forks source link

Error BIO-tag #1

Open war24ever opened 1 year ago

war24ever commented 1 year ago

Halo mas Reza, saya sedang mencoba untuk scraping berita baru dengan cara mengganti alamat website sumber berita (dari kompas.com) yaitu . Saya kemudian melakukan tagging manual file csv sesuai dengan tutorial, namun pada saat running code pada baris train_data = convert_to_spaCyformat(df_tagged, annotations) terjadi error dengan traceback seperti di bawah ini. `--------------------------------------------------------------------------- AssertionError Traceback (most recent call last)

in ----> 1 train_data = convert_to_spaCyformat(df_tagged, annotations) ~/anaconda3/envs/project17-spacyNER/spaCy-NER-master/BIOtagging.py in convert_to_spaCyformat(df, listOfEntities) 194 assert 1 == 0, \ 195 "Something error in the BIO-tag you wrote. Error BIO tag: '{}'" \ --> 196 .format(entities[j][2]) [text_tagged_done.csv](https://github.com/utomoreza/spaCy-NER/files/10049684/text_tagged_done.csv) 197 elif entities[i][2][0] is 'b': 198 # print('end1b', entities[i-1][1]) AssertionError: Something error in the BIO-tag you wrote. Error BIO tag: 'b-nama' ` [text_tagged_done.csv](https://github.com/utomoreza/spaCy-NER/files/10049686/text_tagged_done.csv) ini file hasil tagging entitas manual yang saya lakukan. mohon petunjuknya terkait error tersebut mas. terima kasih.
utomoreza commented 1 year ago

halo. bisa lihat isi variable listOfEntities nya?

war24ever commented 1 year ago

di bawah ini potongan kode BIOtagging.py. saya tidak mengubah kode pada file BIOtagging.py ini. atau memang perlu ada penyesuaian kah.

`def convert_to_spaCyformat(df, listOfEntities): """ This function is used to convert the BIO-tagged-DF to spaCy format annotations.

Args:
- df (pandas.DataFrame) > BIO-tagged dataframe consisting of two columns, i.e. token and BIO_tag
- listOfEntities (list) > list of entities/annotations used

Return:
- [text, enti] > a list consisting of the text (combined from the tokens) and the interested entities as accordance with spaCy format
"""
# check if NaN exists
assert not (df.iloc[:,0].isnull().any() or df.iloc[:,1].isnull().any()), 'The dataset contains nan value.'

# create a dictionary to save the columns of 'token' and 'BIO_tag', and we also define the index of tokens in order
dictTemp = {}
dictTemp['token'] = np.array(df.iloc[:,0])
dictTemp['BIO_tag'] = np.array(df.iloc[:,1].str.lower())
dictTemp['indices'] = np.array([len(i) for i in dictTemp['token']])

# first, we need to get the index of the first token
total_idx = [dictTemp['indices'][0]] 
temp = dictTemp['indices'][0]

# then we use for loop to count index for each token in cumulative
for i in range(len(dictTemp['indices'])):
    if i > 0:
        temp += dictTemp['indices'][i]
        total_idx.append(temp)

# create variable for the start index of each token
dictTemp['start_idx'] = np.array([total_idx[i-1] if i > 0 else 0 for i in range(len(total_idx))])

# create variable for the last index of each token
dictTemp['end_idx'] = np.array(total_idx)
del dictTemp['indices'] # we no longer need variable indices. then remove it.

enti = {}
entities = []
text = ''.join(dictTemp['token'])

# combine each of listOfEntities with prefix 'b-', 'i-', and 'e-', and add 'o' annotation
listOfEntities = ['b-'+i.lower() for i in listOfEntities] + \
                 ['i-'+i.lower() for i in listOfEntities] + \
                 ['e-'+i.lower() for i in listOfEntities] + ['o']

# check if each BIO-tag is in listOfEntities
error_tag = []
error_boolean = []
for i in np.unique(dictTemp['BIO_tag']):
    if i in listOfEntities:
        error_boolean.append(True)
    else:
        error_boolean.append(False)
        error_tag.append(i)
assert all(error_boolean), "Some BIO-tag not listed in listOfEntities arg. {}".format(error_tag)

# fill in entities list with all non 'O' annotations
for row in range(len(dictTemp['token'])):
    if dictTemp['BIO_tag'][row] != 'o':
        entities.append((dictTemp['start_idx'][row], 
                         dictTemp['end_idx'][row], 
                         dictTemp['BIO_tag'][row]))

start = []
end = []
BIO = []
i = 0
while i < len(entities):
    try:
        if entities[i][2][2:] == entities[i+1][2][2:]:
            if entities[i][2][0] is 'b':

print('start1', entities[i][0])

                start.append(entities[i][0])
                i += 1
                if entities[i][2][0] is 'e':

print('end1a', entities[i][1])

                    end.append(entities[i][1])
                    BIO.append(entities[i][2][2:])
                    i += 1
                    continue
                elif entities[i][2][0] is 'i':
                    for j in range(i, len(entities)):
                        if entities[j][2][0] is not 'e' and j < len(entities)-1:

print('sana', entities[j])

                            continue
                        elif entities[j][2][0] is 'e':

print('end1b', entities[j][1])

                            end.append(entities[j][1])
                            BIO.append(entities[j][2][2:])
                            i = j+1
                            break
                        else:
                            assert 1 == 0, \
                                "Something error in the BIO-tag you wrote. Error BIO tag: '{}'" \
                                .format(entities[j][2])
                elif entities[i][2][0] is 'b':

print('end1b', entities[i-1][1])

                    end.append(entities[i-1][1])
                    BIO.append(entities[i-1][2][2:])
                    continue

print('ss',i,j)

        else:

print('start2a', entities[i][0], i)

            start.append(entities[i][0])

print('end2a', entities[i][1], i)

            end.append(entities[i][1])
            BIO.append(entities[i][2][2:])
            i += 1
    except IndexError:

print('start2b', entities[i][0], i)

        start.append(entities[i][0])

print('end2b', entities[i][1], i)

        end.append(entities[i][1])
        BIO.append(entities[i][2][2:])
        i += 1

enti['entities'] = [(i,j,k) for i,j,k in zip(start, end, BIO)]
return [text, enti]`

Kalo dataset yang sudah saya beri entitas secara manual seperti saya lampirkan di chat pertama.

utomoreza commented 1 year ago

oke. tapi bisa minta tolong tunjukkan ke saya isi variable annotations atau listOfEntities yang ada di line ini train_data = convert_to_spaCyformat(df_tagged, annotations) ? karena saya perlu cek apakah isi variable tersebut sudah mengcover seluruh kebutuhan entities yang digunakan di file CSV terlampir

war24ever commented 1 year ago

image ini variabel nya mas.

utomoreza commented 1 year ago

oke. saya coba reproduce error nya dulu ya. mungkin agak lama ya soalnya saya sambil mengerjakan hal lain.