Colaboration for Nordic languages and import to Huggingface datasets?

saattrupdan / TExAS

The Translate and Extract using Attention Scores method, to convert SQuAD-like question-answering datasets into (almost) any language.

MIT License

2 stars 0 forks source link

Colaboration for Nordic languages and import to Huggingface datasets? #1

Closed MarkusSagen closed 2 years ago

MarkusSagen commented 2 years ago

Hi @saattrupdan, I've been following your research in multilingual models and happen to come across this repo. Have been thinking of a similar approach for automatic translation of QA datasets so was very happy to see this repo. I was wondering if you would be interested in some sort of collaboration or would be open for PRs, for instance creating dataset wrappers for export to the Huggingface datasets library?

saattrupdan commented 2 years ago

Hey @MarkusSagen,

Thanks for your interest in the repo! I'd be very happy both to collaborate as well as being open for PRs. Are you thinking of anything specific?

The current matching algorithm is still very much a work in progress, as there are several silly mistakes currently, and @vesteinn and I are currently working on improving (or replacing) the current method.

As for exporting to Huggingface this should be quite simple, as the files are all JSONL files, and so can be loaded in as a HF Dataset simply as Dataset.from_json(path/to/file).

MarkusSagen commented 2 years ago

Great to hear! I started testing it out for Swedish and Spanish. I'm very open to helping out in which ever way I can. At first, I was going to propose translating several datasets, but as you said, I've started to see some irregularities with the answer text span. Happy to help and contribute if there is anything

saattrupdan commented 2 years ago

@MarkusSagen Yeah I've tried translating SQuAD v2 to Spanish, French, German, Danish and Icelandic, resulting in these QA models:

They work reasonably well, but it would be nice to ensure a higher quality. We were considering trying a translation approach, incorporating tags that mark the answer (or maybe a combination of that and the current approach). We're both super busy though. Is that something you'd feel comfortable with working on? If so, we could set up a meeting where we discuss more concretely what we're thinking :)

MarkusSagen commented 2 years ago

@saattrupdan Absolutly! That sounds great :) Maybe easiest to continue this conversation by mail then?