viresh-ratnakar / lufz

MIT License
1 stars 0 forks source link

WikiExtractor problem #1

Closed iwmo closed 1 year ago

iwmo commented 1 year ago

Hi Viresh,

First of all thanks for the amazing tools. Im trying to create a lexicon.js based on Portuguese words but I'm struggling to run the commands mentioned. Have the words list and the wiki. You mentioned "This will write many files named text/??/wiki_??", but i only get one file with all the words after running WikiExtractor.py.

and then I'm unable to run the next command.

Any idea what is going on? Also do you think this will work? I mean I'm trying to help my father to have a Portuguese version of your crosswords solution. :)

Thanks

viresh-ratnakar commented 1 year ago

That's great that you're trying to create a Portuguese version of Exet! Curiously enough, I'm at the moment slowly working towards adding Hindi support. Some of the simple first steps for that include splitting up exet.html into exet.{css,js,html}, as I imagine then we can simply create small language-specific variants of exet.html, such as exet-hindi.html and exet-portuguese.html. This refactoring should be ready in a week or so.

Does Portuguese have compound characters (more than one unicode making up a single character), or is it pretty much like English?

I haven't run the lufz code in quite a while, and it's possible that it needs to be updated. It's also possible that it needs some tweak specific to Portuguese. You say you get only one file.. maybe the pt data is much smaller. How big is this file? Can you please send me a copy? If it is too big, perhaps you can send me the first 10k lines from it? I would love to help with this project.

On Thu, Mar 9, 2023 at 12:11 PM I Want My Own Coin @.***> wrote:

Hi Viresh,

First of all thanks for the amazing tools. Im trying to create a lexicon.js based on Portuguese words but I'm struggling to run the commands mentioned. Have the words list and the wiki. You mentioned "This will write many files named text/??/wiki_??", but i only get one file with all the words after running WikiExtractor.py.

and then I'm unable to run the next command.

Any idea what is going on? Also do you think this will work. I mean I'm trying to help my father to have a Portuguese version of your crosswords solution. :)

Thanks

— Reply to this email directly, view it on GitHub https://github.com/viresh-ratnakar/lufz/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ562CZQVZR53ZM5DLZ6YLW3I2OXANCNFSM6AAAAAAVVSGP7E . You are receiving this because you are subscribed to this thread.Message ID: @.***>

iwmo commented 1 year ago

Just trying to help my father and some other people that are hobbyist of crosswords for many years. I'm also trying to learn a bit more about python and JS, so I have some fun trying to help them.

Portuguese has some punctuation like in my name, João. Also some other like é, à, á and a few more.

Im attaching a sample of my wordlist and also the output of the WikiExtractor

Let me know if I can help in any other way. worlist.txt

sample wiki.txt

viresh-ratnakar commented 1 year ago

Thanks. Looks like they changed the options/format of WikiExtractor a bit. Looking into the needed changes.

Where did you get the Portuguese word list? Is it open-source data that anyone can use? If so, can you please point me to the source?

iwmo commented 1 year ago

I found it from a Portuguese university. They have a few more list qnd probably later I will try to compile my own version. Specially because there is also Brazilian Portuguese that has some additional words and they also have many users. Index of /download/sources/Dictionariesdi.uminho.ptThey have a few resources.I think it’s open source, but once I compile my how I will definitely share it with you. Enviado do meu iPhoneNo dia 10/03/2023, às 02:43, Viresh Ratnakar @.***> escreveu: Thanks. Looks like they changed the options/format of WikiExtractor a bit. Looking into the needed changes. Where did you get the Portuguese word list? Is it open-source data that anyone can use? If so, can you please point me to the source?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

iwmo commented 1 year ago

Hi Viresh,

Quick one. For the purpose is it better to have wordlist compiled or to work on dictionaries. Just asking so that I can start working on those.

Thanks

viresh-ratnakar commented 1 year ago

My README.md was pointing to an obsolete/different version of WikiExtractor.py. I have now updated it. Please use the updated instructions and you should be able to generate the popularity signal for Portuguese.

Not sure what you mean with "is it better to have wordlist compiled or to work on dictionaries." If you can generate/find a good, open-source Portuguese word-list, imo that will be very useful. It will be easy for me to create a Portuguese version of Exet (after the refactoring that I'm currently working on) from the wordlist. Another useful thing would be to get phonemes for some or all the words (For English, I use CMUdict). But that's not essential.

iwmo commented 1 year ago

I'm working on getting a good word list. I'll share it once all the cleaning is done. Meanwhile tried to use the extractor with new instructions but still getting some errors, not being able to finish it.

python -m wikiextractor.WikiExtractor ptwiki-latest-pages-articles.xml.bz2 Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/oriebirj/Desktop/lufz/wikiextractor/WikiExtractor.py", line 66, in from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces File "/home/oriebirj/Desktop/lufz/wikiextractor/extract.py", line 382, in ExtLinkBracketedRegex = re.compile( ^^^^^^^^^^^ File "/usr/lib/python3.11/re/init.py", line 227, in compile return _compile(pattern, flags) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/re/init.py", line 294, in _compile p = _compiler.compile(pattern, flags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/re/_compiler.py", line 743, in compile p = _parser.parse(p, flags) ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/re/_parser.py", line 980, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/re/_parser.py", line 455, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/re/_parser.py", line 863, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/re/_parser.py", line 455, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/re/_parser.py", line 863, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/re/_parser.py", line 455, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/re/_parser.py", line 841, in _parse raise source.error('global flags not at the start ' re.error: global flags not at the start of the expression at position 4

Any idea what could be?

I'm also looking for the phonemes, but its a bit more challenging.

:)

viresh-ratnakar commented 1 year ago

Sorry, no idea what those error messages are. However, I did download pt Wikipedia myself and was able to WikiExtractor.py, I don't think it produced these error messages for me -- or maybe it did, I do not quite remember.. It did generate about 70 MB of text, which is good enough.

Once you have a word list, I can run the remaining steps. Of course, if you have the same 70 MB extracted, you can try too.

iwmo commented 1 year ago

wish I had but I'll keep on trying to sort the issue. meanwhile sharing a list with you. lista palavras.txt

Also found this website: http://www.portaldalinguaportuguesa.org/index.php?action=fonetica&act=list

they have a list of phonemes. I tried to scrap it but the format of the HTML beat me up. :)

viresh-ratnakar commented 1 year ago

Thanks! I need to make a few tweaks to the Lufz code, but should be able to do it soon. Let's do phonemes as a follow-up.

I do have some questions/requests for you.

  1. If the word list that you generated has been taken from a single source, with very little modification, please let me know what the source is, so that I can credit them and include any license notices that they may have. If (and I think this might be the case), you have taken one or more sources and combined them, and then edited the list, I would recommend that you make a github repository for this list, including a license notice (hopefully MIT License), and including a README.md file that lists all the sources that you used and any license notices that they may have. Then, In Exet/Lufz, I can simply point to your repository as the source (of course, if any of your sources require their license notice to be included in any application, I will include them as well).

  2. For Portuguese crosswords, based upon this Wiki article:

    • In Portuguese, diacritics are ignored with the exception of Ç. Therefore, A could be checked with à or Á.

    So, I am thinking of normalizing all letters to [A-Z], except for Ç. Do you think this is the right thing to do? How do the newspaper crosswords handle diacritics?

iwmo commented 1 year ago

1- list of words supplied was from: https://github.com/fserb/pt-br/blob/master/palavras found another repo with extensive list: https://github.com/AlfredoFilho/Palavras_PT-BR

Also created on my repo, a list of phonetics: https://github.com/iwmo/dicionario-fonetico

2- After checking with my father, the list of rules is a bit more extensive, at least the one used by Portuguese/Brazilian puzzlers (charadiste). They are into puzzles involving words, not exclusively crosswords. In this case they drop all other diacritics other than Ç and Ã. Also words with "-" can be used if both crossing use it. There is also a limite for the number of black squares, depending on the size of the grid. If you're interested, I will gather more info you may find relevant.

Thanks

viresh-ratnakar commented 1 year ago

Quick update: I am working on this, whenever I find time :-). I expect to have something in a week or so.

iwmo commented 1 year ago

That’s awesome. Very excited to see the results. Let me know in case I can help with anything. Enviado do meu iPhoneNo dia 01/04/2023, às 07:05, Viresh Ratnakar @.***> escreveu: Quick update: I am working on this, whenever I find time :-). I expect to have something in a week or so.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

viresh-ratnakar commented 1 year ago

I've finally created this. Please give it a go at:

https://viresh-ratnakar.github.io/exet-brazilian.html

This being the very first version, there are bound to be a lot of rough edges. Please feel free to file bugs and feature requests at https://github.com/viresh-ratnakar/exet/issues