migaku-official / migaku-browser-extension-issue-tracker

An issue tracker for bugs and other issues related to the Migaku browser extension.
2 stars 0 forks source link

[French] Parser Not Splitting Contractions #6

Open cofinley opened 2 years ago

cofinley commented 2 years ago

Describe the bug The browser extension is parsing contractions (I can only speak for French words here), as whole words and not splitting them into the two words being contracted.

This leads to a lot of overlap words that I need to mark as 'known'. For instance, I know the words le and organization, but the extension sees their contraction, l'organization, as a whole different word.

Examples:

This seems to happen with just about every contraction (examples of French contractions).

To Reproduce Steps to reproduce the behavior:

  1. Make sure French is used for the browser extension
  2. Use a French dictionary (can be bi/monolingual)
  3. Go to https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-fran%C3%A7aise/journal-en-francais-facile-02122021-20h00-gmt
    • Really any French site/subtitles will demonstrate this
  4. Click 'parse'
  5. Find words that start with l'
    • E.g. L'Organization, l'infrastructure, etc.
    • Can also search for c', d', qu' contractions, etc.
  6. Shift-hover on the word (e.g. organization, not the preceding article (e.g. l')) to pull up popup

Expected behavior The hovered-on word should be the search query (e.g. organization).

Actual behavior The whole word (contraction included) is used for the search query.

Screenshots chrome_2021-12-04_17-35-19 chrome_2021-12-04_17-35-52 chrome_2021-12-04_17-36-15 chrome_2021-12-04_17-36-55 chrome_2021-12-04_17-42-33 chrome_2021-12-04_17-44-29

Desktop (please complete the following information):

KieranBrannigan commented 2 years ago

Thank you for your detailed bug report.

We will try to work on this as soon as possible once we finish work on other releases.

LucasMIA commented 2 years ago

We are finally getting around to checking your bug out. Sorry for the long wait.

cofinley commented 2 years ago

No worries! I know y’all have a lot on your plate.

I think (hope) this is more of a regex issue than a lexical/grammar parsing issue.

I usually don’t need the contracted word; I’m uncertain of cases where I want the two words.

Hyphens are similar: most of the time I don’t need both words being hyphenated, but sometimes I do (e.g. après-midi for afternoon).

Perhaps there could be an option per-lookup to adjust the highlight boundary? Like “default to smallest boundary”, “capture whole word form,” etc.. This is probably a naive approach, but food for thought.