Closed ablodge closed 4 years ago
To set up STREUSLE json, download the repo and run:
python conllulex2json.py streusle.conllulex > streusle.json
python govobj.py streusle.json > streusle.go.json
*In Windows, you may have to convert file to utf8 by hand
@nschneid What is the significance of "notes" in streusle.go.notes.json
?
Oh hmm. As I recall there were some per-token or per-sentence annotator comments from the original annotation that aren't part of the official STREUSLE release but that we include in Xposition because they can help explain what's going on.
e.g. http://flat.nert.georgetown.edu/ex/4973/ in the ℹ column
Is there a script that adds notes to the json? I don't see that field in the connlulex or json files.
Your README in the scripts directory mentions annotator_notes.py. Was that a script you wrote?
BTW I think there are sentence-level notes, called note
in the corpus sentence model, and token-level notes, called annotator_cluster
.
I forwarded you an email that explains the script/CSV files are in the https://github.com/nert-nlp/streusle/tree/prepare-4.0 branch
I don't see annotator_notes.py
in either repo, but it should be pretty easy to reverse engineer if you have the original file with notes.
annotator_notes.py
references files 'prepv-tokens.csv', 'psst-tokens-revisions.csv', 'allbacktick-tokens-revisions.csv', and 'current-psst_20150830.sentnotes.csv'. I can see that they are all in the prepare
branch. Are they still relevant and up to date for the new corpus release?
Ah. I forgot the system for collecting notes was so ad hoc. I will have to look to see if there are any spreadsheets with more recent notes.
I can't find any annotator notes for updates executed on STREUSLE 4.1, 4.2, or 4.3. It looks like some of the spreadsheets for updates have color-coding but no textual comments on the decisions.
The only notes I could find were in https://github.com/ryanamannion/streusle/blob/ssupdate/needssupersense.xlsx, and some of these changes may have made their way into 4.2, but AFAICT it was on a case-by-case basis.
So please just use the notes as of the 4.0 release.
(In the future, we should consider a systematic workflow for changes that will allow us to record these notes, perhaps as comments in streusle.conllulex.)
Note: When importing Supersenses, the importer expects an article for each supersense to exist already.
Hmm. Is there a way to just import the corpus examples? We already have the supersenses set up.
Yes. I'm just working through the steps in https://github.com/nert-nlp/Xposition/tree/master/scripts to load models from scratch without the current database.
Another unexpected problem. This is encoded in utf8. It is probably due to curly quotes, such as “ no reservation , sign~ your ~name here ” taken from the corpus.
For now, I'll fix this using the unidecode package.
Error AttributeError at /admin/metadata/corpussentence/import/ module 'tablib.formats._xls' has no attribute 'title'
in C:\Users\austi\Anaconda3\lib\site-packages\import_export\formats\base_formats.py in get_title return self.get_format().title
while trying to import corpus sentences.
Apparently, this is related to the tablib version. It can be fixed with pip install tablib==0.14.0
.
Error: InvalidDimensions encountered while trying to read file: corpus_sentences.tsv
displayed with no stack trace when trying to import corpus sentences.
Note: the first 500 sentences work fine.
Addendum: This error was because of unescaped quote characters. Using a csv writer fixes the problem.
Related: Does the class CorpusSentenceResource
need to specify a corpus version. It seems to be missing, which may cause problems later.
When unpacking xposition database file with sqlite, I get the error:
Error: near line 3: file is not a database
Error: near line 9: file is not a database
Error: near line 10: file is not a database
Error: near line 11: file is not a database
...
Hmm, maybe .read
requires a valid .db file to exist already, whereas the sqlite3 x.db < x.sql
option does not.
Alright, the import is almost working, but there are a few issues. I can convert streusle data to json and tsv files, and I can import corpus sentences for different versions of the same corpus, but:
New corpus sentences need to have different ids than the ones in the database. This is because annotations are associated with a sentence which is identified uniquely by an id, but different corpus versions can have different annotations. To import import two sentneces and have them stored as seperate objects, they need different ids.
I'm not sure why they would need different sent_id
s: the CorpusSentence
model specifies
unique_together = [('corpus', 'sent_id')]
which says only that each sent_id
within a corpus instance has to be unique.
Note that each instance of a model has an automatic unique ID assigned (I think it's called .pk
for primary key), and this is what's used internally for database joins.
In that case, there is a problem with the importer. Whenever importing a new corpus sentence that has the same sent_id as an existing sentence, the importer overwrites the data in the existing sentence instead of creating a new object, and this is true even if the corpus versions are different.
Maybe due to how admin.py is configured? Could be related to https://github.com/nert-nlp/Xposition/blob/5168edd9664ecc7683a75322695e273dd2a561a2/src/wiki/plugins/metadata/admin.py#L196
though I don't know exactly what import_id_fields
does.
(Maybe import_id_fields = ('corpus', 'sent_id')
would do the trick, as these are the fields that uniquely define the CorpusSentence instances?)
When import PTokenAnnotations I get the following:
This is strange because (1) the code does check to see if the sentence is unique (here), and (2) PTokenAnnotation has no feature called sentence_id
. It's just called sentence
. So I'm wondering how this error was generated.
Unique constraints are enforced at the database level, so sentence_id
may be the underlying name in the database. Look at the database table schemas to see where this might be coming from and then figure out if the model is correct.
The model does use sentence_id
as an attribute in the database.
CREATE INDEX "metadata_ptokenannotation_adposition_id_0858ff6a" ON "metadata_ptokenannotation" ("adposition_id");
CREATE INDEX "metadata_ptokenannotation_construal_id_0cc05c85" ON "metadata_ptokenannotation" ("construal_id");
CREATE INDEX "metadata_ptokenannotation_sentence_id_b92bfeeb" ON "metadata_ptokenannotation" ("sentence_id");
CREATE INDEX "metadata_ptokenannotation_usage_id_eaff3eb9" ON "metadata_ptokenannotation" ("usage_id");
CREATE UNIQUE INDEX "metadata_ptokenannotation_sentence_id_token_indices_d911a308_uniq" ON "metadata_ptokenannotation" ("sentence_id", "token_indices");
I'm checking to see if the 4.3 data is internally consistent (if every annotation has a unique sent_id, toke_indices pair). If not, then that's easier to fix. It would also explain why import sentences didn't cause a unique constraint error but importing annotations did.
All the different steps work! I'll make a pull request and review the changes.
The script for converting connlulex to django models files is here:
https://github.com/nert-nlp/Xposition/blob/master/scripts/models_for_import.py
Models can then be imported in admin mode assuming django-import-export is installed.
@nschneid If you take a look at the script, you'll see that it's pretty hardcoded for the data from a few years ago. Can you confirm that the new data hasn't changed in terms of format?