Open claell opened 5 years ago
@claell If you could give me some concrete pages where this happens, then I'd be more than happy to look into this. Doesn't have to be immediately, but when you stumble upon on one, just post it here.
FYI, already, I've noticed this in certain cases too (but not all versions you mentioned). Specifically, when a paragraph ends with a footnote. Unfortunately, in such cases the Wikipedia API makes the mistake of swallowing the white space after the period (and that part of the API I'm using also swallows the footnote itself, which makes sense, because we don't need it in a spoken version). Anyway, coming up with a heuristic when it is okay, to fill in a pause, when a dot is not followed by white space, is not trivial. So that's why I haven't tackled the problem yet :-/.
I did some search yesterday and found that those mistakes should happen on the German page for "Omelett".
Example sections:
An den Singular der femininen Form kann in Österreich noch ein 'n antreten.[1]
Das Wort stammt aus dem Französischen und wurde im 18. Jahrhundert
Here is the dot read out as "Punkt" and no pause is there between "antreten." and "Das Wort". This might be caused by the Wikipedia API. To me that looks like a bug in their API, is there already a bug report for it?
Siehe auch
Rührei
There is no pause after "Rührei" and "Soll ich noch weiterlesen?".
Okay, nice example. Have a look at:
That's the API that I'm using and as you can see in the json payload when you click on the link, it says:
... ein 'n antreten.Das Wort stammt...
is there already a bug report for it?
No idea. Honestly, I just didn't have the time to look into it closer yet. If you want to help report it, I'd very much appreciate your support.
There is no pause after "Rührei" and "Soll ich noch weiterlesen?".
That's indeed a bug in my code. Right here:
https://github.com/petergtz/alexa-wikipedia/blob/master/skill/skill.go#L258-L261
I'll see if I can fix it some time soon!
Thanks for testing with the API and the link. I think I found where to report bugs and created one: https://phabricator.wikimedia.org/T236128
So lets hope, it get fixed soon :)
Regarding your code, I am not sure, how to handle this. I guess the problem is that there is no dot after Rührei
, so some logic has to be introduced to detect that first and insert a dot when needed?
Thanks for testing with the API and the link. I think I found where to report bugs and created one: https://phabricator.wikimedia.org/T236128
So lets hope, it get fixed soon :)
That's pretty cool! Thank you. I also saw, that they already have a duplicate open. Unfortunately, they also say they don't plan to fix it. On the positive side, they would welcome a patch.
Regarding your code, I am not sure, how to handle this. I guess the problem is that there is no dot after Rührei, so some logic has to be introduced to detect that first and insert a dot when needed?
Yes, exactly.
Yes, I also noticed that they linked the duplicate and read they don't plan to fix it. Let's see. I assume the fix won't be too complicated, but the problem is where to look for the code lines causing this. Also the Wikipedia API seems to be pretty big feature wise and this seems only to be an extension which is apparently not that popular. So one thought was to maybe use a better supported part of their API, although I did not find any to get the content of an article at a first short look.
So one thought was to maybe use a better supported part of their API, although I did not find any to get the content of an article at a first short look.
Yes, that would certainly help. I've spent quite some time one or two years ago finding an API that seemed to work and serve my purpose best. The problem is that all other APIs that I found so far, always return wikitext or HTML, but no plain text. And parsing wikitext or HTML and extracting just the right text, is completely out of scope. It's a bit of a dilemma :-).
Hm. At least there is an API who returns Wikitext or HTML. Did not find that either on my first short glimpse.
If one would parse something then I think Wikitext is better than HTML. It would probably have the benefit that certain things can be detected and passed to the TTS engine, for example the quotes as suggested in #35 can probably be detected this way.
So in the long run, changing to Wikitext might be useful anyway. However I know that it will require a lot of additional work to put into this project (although there might be parsers there for it that can be built on), which you do in your freetime. So I understand that it might just be to much to ask for.
So I understand that it might just be to much to ask for.
I'm more than happy to accept contributions though. So if you like, give it a shot.
I am just not experienced with Go at all and also a bit time restricted, same as you, probably. Will keep it in mind, though. What I definitely will offer is help, if you decide to implement it.
Sounds good!
There is no pause after "Rührei" and "Soll ich noch weiterlesen?".
There is now :-). Please check it out.
Nice, thank you! Works for me.
Anyway, coming up with a heuristic when it is okay, to fill in a pause, when a dot is not followed by white space, is not trivial.
I have thought about this again today, since that would be an easier fix. I think that detecting the pattern lowercase letter, dot, uppercase letter
should work for most cases and should not give many false positives. I thought about abbreviations like "z.B.", although those should normally be formatted with a protected space in between.
I have thought about this again today, since that would be an easier fix. I think that detecting the pattern
lowercase letter, dot, uppercase letter
should work for most cases and should not give many false positives. I thought about abbreviations like "z.B.", although those should normally be formatted with a protected space in between.
Agree. Good idea! I think one step that I want to put in between though, is gathering data about this. We could first report all cases it would alter, and let it run for a week or so. Afterwards, we could check if there are be any false positives. And if there aren't, add the mechanism to insert the space.
Maybe using a github issue to list all the cases would provide the necessary transparency.
Sounds good, that will avoid possible annoying problems and also will deliver some stats about it beforehand.
@claell In case you're curious, https://github.com/petergtz/alexa-wikipedia/issues/40 now contains all cases seen so far where we'd be inserting a space. In a week or so we can revisit. But it already looks quite good. No false positives so far.
Thanks for the hint! Did not know this testing phase has already been implemented. The current results look indeed pretty promising.
Well, it just went live last night. :-)
Ah, I just looked at the three days old comment there, but not on the edits. So you managed to implement automatic updates to this GitHub comment whenever the pattern is detected in the Skill, probably after a session has ended?
Yes. And not just after a session, but on every request.
It's kind of awkward, because it doesn't always appear right away, because AWS Lambda freezes the container after a response is sent to Alexa and the update in Github is happening asynchronously. But I wanted to avoid latency in the skill response due to this.
So sometimes things get written out to the Github comment on the next request. But since it is not time critical, this seemed good enough. And indeed it works.
It looks like umlauts at the end/beginning of the snippets are messing up the duplication avoidance. Will have to fix that to avoid further duplicates.
Also, the pattern currently, only takes into account A-Z. Should probably change that to any letter.
And indeed it works.
Pretty cool! And there also seems to be an automatic error reporter creating GitHub issues? This looks just great! Is this from a different project offering this or original work? Might be helpful for other Skills as well, although I am not sure how many use Go and are interested in GitHub issue tracking.
It looks like umlauts at the end/beginning of the snippets are messing up the duplication avoidance.
Nice, there is duplicate avoidance, did not know that. Is this an encoding issue with umlauts?
Should probably change that to any letter.
Like Umlauts? Or other languages?
And there also seems to be an automatic error reporter creating GitHub issues?
Yes, it creates github errors, but it also publishes messages on AWS SNS, which then get sent to me as emails. The emails contain the error message, a stacktrace and the request itself. That's even more convenient than to take the query in the github issue and paste it into AWS CloudWatch. I don't put all this information into a github issue, because I don't want to risk publishing data that's not supposed to be public. Sometimes, when I'm not lazy I paste the stacktrace back into the github issue for reference, but not always :-).
Is this from a different project offering this or original work?
It's original work. Actually, the original work is in my alexa-journal skill and so far, I've simply copied it over. But my plan is to extract it into a separate repo, so it can be re-used just like you already described. Indeed, though, I'm not sure if anyone else will use it.
Is this an encoding issue with umlauts?
Not an encoding issue, but because I'm chopping things off exactly after 10 bytes, instead of 10 runes.
Like Umlauts? Or other languages?
Yes. And like accents and all that kind of stuff. I just realized that it's not that easy though. Because even \w
is not covering them. Maybe it's good enough the way it is.
Maybe it's good enough the way it is.
I think it is. It works for most cases at least, so if much more work is required to cover edge cases it is probably not worth it currently, at least unless somebody complains about it.
It's getting interesting: I found 2 false positives: German "e.V." and English "Ph.D.". Both get read incorrectly by Alexa when inserting a space (she's pausing in between). Let's wait a few more days. Maybe we'll find more. (Let's still implement the algorithm as suggested by you. I think it's a great heuristic. We just need to special case our findings.)
I also saw the "e.V.". I thought it would normally be formatted with a protected space in between, so assumed, that would be no problem for Alexa. On Wikipedia it is written with a space in between on the "Verein" page: https://de.wikipedia.org/wiki/Verein#Eingetragener_Verein
So I assume for this example, the Alexa TTS readout just is wrong. It might be interesting to investigate, whether the Wikipedia API also swallows protected spaces or the occurrence was just written without a space in the Wikipedia article.
However, such non-breaking spaces don't seem to be used for English abbreviations: https://en.wikipedia.org/wiki/Non-breaking_space#Width_variation
So "Ph.D." should really not contain such a space.
Another one: "G.m.b.H." and "Holding S.p.A. übernom"
Interestingly I also saw: "für 1 Mrd.US-Dollar A". There should normally be a space in the Wikipedia article. Maybe a swallowed up non-breaking space.
Also, the duplicate detection and handling of umlauts seems to be gone again? At least when I scroll down there are duplicates again, but also some in English.
the duplicate detection and handling of umlauts seems to be gone again
No, never had the time to fix it.
Sorry, I probably misinterpreted something. I thought it was fixed already, probably with the commit you linked. I know that one mostly doesn't have the time one wants to work on such side projects. So no pressure and I don't expect anything. Your current rate of response to issues etc. is already way higher than most projects on GitHub I experienced so far.
I noticed that sometimes there is a lack of pause when a sentence ended. Instead it flows directly into the next sentence. That sometimes happened also with the "Soll ich weiterlesen?" prompt.
Also I noticed that dots at the end of a sentence are sometimes read out as "Punkt".