nert-nlp / Xposition

Cataloging the semantics of adpositions (prepositions, postpositions) and case markers in multiple languages.
GNU General Public License v3.0
4 stars 4 forks source link

Import Streusle 4.3 #194

Closed ablodge closed 4 years ago

ablodge commented 4 years ago

The script for converting connlulex to django models files is here:

https://github.com/nert-nlp/Xposition/blob/master/scripts/models_for_import.py

Models can then be imported in admin mode assuming django-import-export is installed.

@nschneid If you take a look at the script, you'll see that it's pretty hardcoded for the data from a few years ago. Can you confirm that the new data hasn't changed in terms of format?

ablodge commented 4 years ago

To set up STREUSLE json, download the repo and run:

python conllulex2json.py streusle.conllulex > streusle.json
python govobj.py streusle.json > streusle.go.json

*In Windows, you may have to convert file to utf8 by hand

@nschneid What is the significance of "notes" in streusle.go.notes.json?

nschneid commented 4 years ago

Oh hmm. As I recall there were some per-token or per-sentence annotator comments from the original annotation that aren't part of the official STREUSLE release but that we include in Xposition because they can help explain what's going on.

nschneid commented 4 years ago

e.g. http://flat.nert.georgetown.edu/ex/4973/ in the ℹ column

ablodge commented 4 years ago

Is there a script that adds notes to the json? I don't see that field in the connlulex or json files.

nschneid commented 4 years ago

Your README in the scripts directory mentions annotator_notes.py. Was that a script you wrote?

nschneid commented 4 years ago

BTW I think there are sentence-level notes, called note in the corpus sentence model, and token-level notes, called annotator_cluster.

nschneid commented 4 years ago

I forwarded you an email that explains the script/CSV files are in the https://github.com/nert-nlp/streusle/tree/prepare-4.0 branch

ablodge commented 4 years ago

I don't see annotator_notes.py in either repo, but it should be pretty easy to reverse engineer if you have the original file with notes.

ablodge commented 4 years ago

annotator_notes.py references files 'prepv-tokens.csv', 'psst-tokens-revisions.csv', 'allbacktick-tokens-revisions.csv', and 'current-psst_20150830.sentnotes.csv'. I can see that they are all in the prepare branch. Are they still relevant and up to date for the new corpus release?

nschneid commented 4 years ago

Ah. I forgot the system for collecting notes was so ad hoc. I will have to look to see if there are any spreadsheets with more recent notes.

nschneid commented 4 years ago

I can't find any annotator notes for updates executed on STREUSLE 4.1, 4.2, or 4.3. It looks like some of the spreadsheets for updates have color-coding but no textual comments on the decisions.

The only notes I could find were in https://github.com/ryanamannion/streusle/blob/ssupdate/needssupersense.xlsx, and some of these changes may have made their way into 4.2, but AFAICT it was on a case-by-case basis.

So please just use the notes as of the 4.0 release.

(In the future, we should consider a systematic workflow for changes that will allow us to record these notes, perhaps as comments in streusle.conllulex.)

ablodge commented 4 years ago

Note: When importing Supersenses, the importer expects an article for each supersense to exist already.

nschneid commented 4 years ago

Hmm. Is there a way to just import the corpus examples? We already have the supersenses set up.

ablodge commented 4 years ago

Yes. I'm just working through the steps in https://github.com/nert-nlp/Xposition/tree/master/scripts to load models from scratch without the current database.

ablodge commented 4 years ago

image

Another unexpected problem. This is encoded in utf8. It is probably due to curly quotes, such as “ no reservation , sign~ your ~name here ” taken from the corpus.

For now, I'll fix this using the unidecode package.

ablodge commented 4 years ago

Error AttributeError at /admin/metadata/corpussentence/import/ module 'tablib.formats._xls' has no attribute 'title' in C:\Users\austi\Anaconda3\lib\site-packages\import_export\formats\base_formats.py in get_title return self.get_format().title while trying to import corpus sentences.

Apparently, this is related to the tablib version. It can be fixed with pip install tablib==0.14.0.

ablodge commented 4 years ago

Error: InvalidDimensions encountered while trying to read file: corpus_sentences.tsv displayed with no stack trace when trying to import corpus sentences.

Note: the first 500 sentences work fine.

Addendum: This error was because of unescaped quote characters. Using a csv writer fixes the problem.

ablodge commented 4 years ago

Related: Does the class CorpusSentenceResource need to specify a corpus version. It seems to be missing, which may cause problems later.

ablodge commented 4 years ago

When unpacking xposition database file with sqlite, I get the error:

Error: near line 3: file is not a database
Error: near line 9: file is not a database
Error: near line 10: file is not a database
Error: near line 11: file is not a database
...
nschneid commented 4 years ago

Hmm, maybe .read requires a valid .db file to exist already, whereas the sqlite3 x.db < x.sql option does not.

ablodge commented 4 years ago

Alright, the import is almost working, but there are a few issues. I can convert streusle data to json and tsv files, and I can import corpus sentences for different versions of the same corpus, but:

nschneid commented 4 years ago

New corpus sentences need to have different ids than the ones in the database. This is because annotations are associated with a sentence which is identified uniquely by an id, but different corpus versions can have different annotations. To import import two sentneces and have them stored as seperate objects, they need different ids.

I'm not sure why they would need different sent_ids: the CorpusSentence model specifies

unique_together = [('corpus', 'sent_id')]

which says only that each sent_id within a corpus instance has to be unique.

Note that each instance of a model has an automatic unique ID assigned (I think it's called .pk for primary key), and this is what's used internally for database joins.

ablodge commented 4 years ago

In that case, there is a problem with the importer. Whenever importing a new corpus sentence that has the same sent_id as an existing sentence, the importer overwrites the data in the existing sentence instead of creating a new object, and this is true even if the corpus versions are different.

nschneid commented 4 years ago

Maybe due to how admin.py is configured? Could be related to https://github.com/nert-nlp/Xposition/blob/5168edd9664ecc7683a75322695e273dd2a561a2/src/wiki/plugins/metadata/admin.py#L196 though I don't know exactly what import_id_fields does.

nschneid commented 4 years ago

(Maybe import_id_fields = ('corpus', 'sent_id') would do the trick, as these are the fields that uniquely define the CorpusSentence instances?)

ablodge commented 4 years ago

When import PTokenAnnotations I get the following: image

ablodge commented 4 years ago

This is strange because (1) the code does check to see if the sentence is unique (here), and (2) PTokenAnnotation has no feature called sentence_id. It's just called sentence. So I'm wondering how this error was generated.

nschneid commented 4 years ago

Unique constraints are enforced at the database level, so sentence_id may be the underlying name in the database. Look at the database table schemas to see where this might be coming from and then figure out if the model is correct.

ablodge commented 4 years ago

The model does use sentence_id as an attribute in the database.

CREATE INDEX "metadata_ptokenannotation_adposition_id_0858ff6a" ON "metadata_ptokenannotation" ("adposition_id");
CREATE INDEX "metadata_ptokenannotation_construal_id_0cc05c85" ON "metadata_ptokenannotation" ("construal_id");
CREATE INDEX "metadata_ptokenannotation_sentence_id_b92bfeeb" ON "metadata_ptokenannotation" ("sentence_id");
CREATE INDEX "metadata_ptokenannotation_usage_id_eaff3eb9" ON "metadata_ptokenannotation" ("usage_id");
CREATE UNIQUE INDEX "metadata_ptokenannotation_sentence_id_token_indices_d911a308_uniq" ON "metadata_ptokenannotation" ("sentence_id", "token_indices");
ablodge commented 4 years ago

I'm checking to see if the 4.3 data is internally consistent (if every annotation has a unique sent_id, toke_indices pair). If not, then that's easier to fix. It would also explain why import sentences didn't cause a unique constraint error but importing annotations did.

ablodge commented 4 years ago

All the different steps work! I'll make a pull request and review the changes.