Open ricardojosehlima opened 2 years ago
@ricardojosehlima
There are variables in grammar.xml with the words:
<!ENTITY barbarismos "(?:s(?:t(?:r(?:ip(?:tease[rs]?|per|s)?|ess.*|ogonoff|aight)|a(?:nd(?:ard|s)?|ccato|rtup|ff)|o(?:ryboard|kes|ck|p)|i(?:ck|lb)|e(?:nt|p))|p(?:r(?:int(?:er|s)?|ead|ay|ue)|in(?:naker|s)?|a[ms]?|ot)|u(?:b(?:-holding|woofer)|s(?:pense|hi)|perstar|doku|rf)|h(?:ar(?:eware|-pei)|o(?:[tw]|pping)|iatsu|ekel|unt)|a(?:(?:xhor|loo)n|mple[rs]?|shimi|vart)|c(?:herz(?:and)?o|rewball|o[np]e|at)|n(?:o(?:wboard|oker|b)|ack(?:-bar)?)|o(?:(?:cialit|ttovoc|ftwar)e|ul)|m(?:o(?:(?:kin)?g|rzando)|ash)|l(?:i(?:ck|de|p)|alom|ogan)|e(?:t(?:-point|ter)?|xy)|w(?:eat(?:shirt|er)|ing)|i(?:decar|evert|ngle|c)|fo(?:rzand|gat)o|k(?:ipper|ate)|quash)|c(?:a(?:n(?:ta(?:bile|ta)|yoning|iche)|r(?:diofitness|paccio|ré)|(?:che-sex|lzon)e|m(?:p(?:us)?|ber)|t(?:ering|gut)|sh-and-carry)|o(?:r(?:don-bleu|pus)|u(?:lomb|ntry)|(?:spla|wbo)y|(?:ck|v)er|loratura|ntinuum|okie)|h(?:a(?:r(?:leston|treuse)|ise-longue)|e(?:ong-sam|ck-in|ddar)|utney)|l(?:ear(?:ance|s)?|u(?:ster|b)|ipboard|oisonné|arke)|r(?:o(?:issan|que)t|evasse)|zar(?:s?ti|ina)|(?:élad|ân)on|u(?:mbia|p)|ódex|ent)|p(?:a(?:nach(?:és?|es?)|t(?:chouli|hos)|intball|parazzi|hoehoe|lmier|anga|rsec|ssim)|i(?:n(?:s(?:cher)?|ce-nez|g-pong|-?up)?|(?:anofort|ckl)e|ercing|dgin|lé)|o(?:st(?:er(?:iori|s)|-it)|(?:odl|ch|is)e|ltergeist|t-?pourri|grom|p)|e(?:(?:tit-suiss|nc)e|restroika|corino)|r(?:i(?:ori|se)|omenade|essing|áxis)|lay(?:b(?:ack|oy)|station|maker)|hot(?:o(?:-finish|maton)|s)?|u(?:tter|zzle|nk|b))|t(?:a(?:(?:sk-forc|wni)e|l(?:k-show|iban)|ke(?:away|s)?|n(?:dem|k)|volatura|ekwondo|i-chi|blet)|u(?:t(?:(?:ilimúnd|t)i|u)|rbo-diesel|pperware|grik)|r(?:i(?:p(?:-hop|lex)|al)|a(?:velling|iler)|emolo)|i(?:me(?:-sharing|share)|e-break|ramisu|cket)|e(?:le(?:marketing|x)|(?:rylen)?e|chno|flon)|o(?:p(?:(?:les)?s)?|ner|fu|ri)|h(?:esaur(?:us|i)|ink)|(?:w(?:ee|is)|-shir)t|sunami|ópos)|b(?:a(?:by-(?:sitter|doll|grow)|(?:varois|din)e|ckup|ht|rn)|r(?:e(?:akdance|nt)|u(?:shing|nch)|ainstorming|ie)|e(?:ta-tester|nedictus|cquerel|agle|bop)|o(?:bsleigh|dyboard|nsai|xers|ate|rt)|i(?:t(?:map|s)?|odesign|g-bang|p)|u(?:ngee-jumping)|l(?:ister|ague|ues|og)|yte)|k(?:i(?:l(?:o(?:(?:vol|wat)t|b(?:yte|it))|im|t)|t(?:chenette|s(?:ch)?)?|ckbox(?:ing|er)|n[ag]|butz|p)|a(?:r(?:a(?:oke|té)|bovanet|ting)|lash(?:nikov|s)?|mikaze|sba)|r(?:(?:ípto|emli)n|aft|ill)|wa(?:(?:ch|nz)a|shiorkor)|e(?:f(?:fieh|ir)|tchup)|un(?:g-fu|a)|yat)|r(?:o(?:c(?:k(?:er|s)?|aille)|(?:entge|ll-o)n|ttweiler|quefort|aming|okie|deo)|e(?:a(?:lpolitik|dy-made)|p(?:rise|s)|dneck|ggae|m)|i(?:n(?:forzando|ggit|k)|ckettsia|tardando|ff)|a(?:l(?:lentando|enti)|p(?:per|s)?|fting|ve)|é(?:veillon|gie|tro)|öntgen|ubato)|m(?:a(?:t(?:ch-point|rioska)|r(?:keting|chand)|(?:estos|mb)o|(?:st?e|yo)r|na(?:ger|t)|gnificat|jorette|xwell|quis)|i(?:l(?:curie|ady)|s(?:erere|ter)|ndfulness|crofarad)|o(?:d(?:e(?:rato|m)|us)|hair)|e(?:morandum|dley)|u(?:sic-hall|esli)|vdol)|f(?:o(?:[bg]|x(?:-terrier|trot)?|r(?:tran|int)|ndue|yer|lk)|a(?:i(?:t-divers|r-play)|rad(?:ay|s)?|nfreluche|twa|x)|l(?:a(?:sh(?:es)?|menco|t)|i(?:p-flop|nt)|ute)|u(?:n(?:board|ky?)|gato|ton)|r(?:anchising|eeware)|ermata|itness|öhn)|a(?:n(?:ti(?:-(?:establishment|apartheid|dumping)|trust)|gström)|p(?:felstrudel|paratchik|artheid|lomb)|l(?:legr(?:ett)?o|zheimer|ibi)|c(?:celerandos?|id-jazz)|uto(?:pullman|cross)|git(?:-?prop|ato)|yurveda|mabile|irbus)|d(?:o(?:p(?:ing|pler)|wnhill|car|jo|ng)|r(?:ive(?:[rs]|-in)?|ugstore)|e(?:sign(?:er)?|ficit|bye)|i(?:s(?:cman|eur)|rham)|u(?:mping|plex|ce)|a(?:nzón|tcha))|g(?:i(?:ga(?:byte|watt)|rlsband|lbert|nseng)|o(?:(?:odwil|spe)l|belet|uda)|l(?:a(?:snost|mour)|ide)|u(?:aracha|lag)|ru(?:yèr|ng)e|a[ly]|eyser)|h(?:a(?:b(?:it(?:at|us)|anera)|(?:m-ioc-chon|shta)g|rd-rock)|i(?:p(?:p(?:ie|y)|-hop)|drospeed)|o(?:rseball|lding|mo)|eavy-metal|usky|ype)|v(?:i(?:de(?:ocl(?:ip(?:es)?|ub)|s)?|(?:t(?:rin|a)|ntag|vac)e|brato)|e(?:r(?:nissage|sus)|lcro|gan)|o(?:lt(?:e-face|s)?|yeur)|audeville)|l(?:e(?:[dk]|a(?:sing|d)|itmotiv|gato)|o(?:c(?:kout|us)|oping|ess|gin)|a(?:rghetto|ser|mé|ts)|i(?:ngerie|fting))|o(?:ff(?:s(?:hore|et)|ice-boy|line)|s(?:tpolitik|sobuco)|u(?:tsider|guiya)|verbooking|n-?line|ersted|rigami|pus)|j(?:a(?:m(?:-session|boree)|c(?:kpot|uzzi)|zz)|o(?:int-ventur|ul)e|u(?:kebox|nkie)|et-(?:lag|set)|iu-jitsu)|w(?:a(?:(?:lkie-talki|ffl)e|rrant|sp|d)|e(?:b(?:er|s)?|stern)|i(?:ld-card|ndsurf)|o(?:rkshop|n)|hist)|i(?:n(?:ter(?:f(?:eron|ace)|net)|f(?:otainment|luenza)|-octavo|s)|(?:bid|t)em|mpedimenta|ppon|d)|n(?:e(?:(?:cessair|w-ag)e|(?:tspli)?t)|o(?:menklatura|ir)|apalm|uance|ylon)|e(?:n(?:s(?:alada|emble)|tente)|r(?:satz|g)|mmenthal|cstasy|vasé|dam)|qu(?:a(?:lifying|ntum|rk)|i(?:lohertz|che))|z(?:e(?:itgeist|kel|n)|apping)|y(?:u(?:ppie|an)|ang|eti|in)|u(?:ndergroun|ploa)d)">
<!ENTITY barbarismos2 "b(?:irdwatching|lockchains?|odyboarders?)|backdoors?|bots?|c(?:hipset|rowdfunding)s?|desktops?|DNA|dominatrix(?:es)?|draft|geocach(?:ing|ers?)|h(?:atchback|ijab|otspot|overboard)s?|icebergs?|jetpacks?|k(?:ernels?|evlar)|m(?:alware|illennial)s?|n(?:etworking|otch|uggets?)|overclock(?:ings?)|p(?:arkour|hishing|odcast|unchline)s?|RNA|s(?:martwatch(?:es)|ext(?:ing|ortion)|tormtroppers?|treaming)|trackpads?|w(?:ebsite|halewatching|oks?)">
<!ENTITY barbarismos3 "abstracts?|applets?|apps?|backbones?|baconburgers?|banners?|bitcoins?|bits?|blue|bluetooth|bogie|boids?|boost|bottom[-]up|brainstorms?|bullying|burnouts?|cameraman|carjacking|chairman|contactless|crackers?|cracking|crashes|cyberbullying|czar|czares|debugging|demand|developers?|dildos?|downgrades?|downtime|drones?|e[-]bullying|e[-]commerce|e[-]learning|e[-]manuals?|e[-]newsletters?|e[-]readers?|exabits?|exabytes?|fair[-]play|faxes|feeders?|flats?|flyers?|forward|führers?|gaming|gangbang|geeks?|gigabits?|gigabytes?|goals?|gold|hackers?|homebanking|homepages?|idem|jiu[-]jitsu|kilobits?|kilobytes?|know[-]how|KO|layers?|lob|loops?|lossless|marketeers?|maydays?|megabits?|megabytes?|milkshakes?|nerds?|netiquette|newsgroups?|off[-]label|offside|on[-]label|overflows?|p[-]values?|papers?|patch|patches|pens?|petabits?|petabytes?|players?|preview|ransomware|reboot|red|reggaeton|restart|roadmaps?|royalty|royalties|scams?|scammers?|screenshots?|scroll|seals?|shell|shelters?|smartphones?|sockets?|stalking|stealth|strings?|swarms?|tablets?|tagging|takeovers?|terabits?|terabytes?|thresholds?|tips?|toolkits?|tracking|triggers?|trolls?|tweeters?|underscores?|Unicode|updates?|upgrades?|uptime|Usenet|wargames?|warm[-]up|webinar|webinares|webmasters?|webpages?|wireless|woofers?|yes[-]man|yottabits?|yottabytes?|zettabits?|zettabytes?">
The easiest way is to find which words there aren't foreign in pt-BR, and I will add them to an exception token and then create a rule in the pt-PT folder grammar.xml with these words.
Basically, it is to create a rule with just the words added to exception in the main grammar.xml.
Then, duplicate the original rule and make it accept only pt-PT.
ahhhhh... I am unable to write properly 🙂
@marcoagpinto Great! I will try to work on it asap!
@marcoagpinto There are two files attached here:
barbarisms-pt-BR_email.txt refers to barbarisms-pt-BR.txt and the only change is to exclude the suggestions at the end of the file for replacing email and variants for some translations. The change is marked with a * before the words.
barbarisms-pt_nao_BR.txt refers to barbarisms-pt.txt the file which applies to all variants of Portuguese. The words to be selected for the exception you mentioned are marked with a * before them.
Now, for the ENTITY barbarismos 1, 2 and 3 that are in the grammar.xml I recognize many words that in Brazilian standard register no one cares if it has '' or not: suspense, slide, streaming, blockchain, bitcoin. I wonder if any of those would apply for '' so my suggestion is to create some sort of exception for Brazilian variant for these 3 rules not to apply. barbarisms-pt-BR_email.txt barbarisms-pt_nao_br.txt
@ricardojosehlima
I will look at it at 5am 🙂
I promise 🙂
right now I can't focus
@ricardojosehlima
Hello!!!!
I have created a variable for specific pt-BR barbarisms in the main grammar.xml to make it easier to add specific words in the future.
I noticed that the rule already existed (created by me) in pt-PT, so I just added your words to the list.
https://github.com/languagetool-org/languagetool/commit/9f18075d52c1fad03f1646680aded7da46f0e132
https://github.com/languagetool-org/languagetool/commit/ec6d2827d6ed89b2e91606b6f9c08e38ccf4fa56
@ricardojosehlima
Regarding the .txt files you attached above, could you please commit them or make PRs so that I just need to press one button for them to go live?
This way there won't be any risks of me messing up.
Thanks!
Hi @marcoagpinto I made the PR for the barbarisms-pt-BR which was to remove the e-mail suggestions. However, the other file is for both variants barbarisms-pt and if I remove the suggestions (they are 163) from there, I think that it will affect the variants other than Brazilian, am I right? So, how to proceed?
Well, just send the full files to me via-mail, and I will commit them.
This way sucess=100%
@ricardojosehlima
It is done!
https://github.com/languagetool-org/languagetool/commit/bf614bbec164c1e9ef4329d99e98a9ee88521b78
Thanks!
I didn't see the PR request, I have been processing video and audio.
Hi @marcoagpinto the word 'item' is still being flagged in pt-br as foreign, can you check what is going on?
@ricardojosehlima
Sure, it should be very easy to fix 😃
I will fix it at 5am…
I don't know if you have been following the conversation regarding the disambiguator, but now more verbs which were detected as nouns are truly detected as verbs.
I need to grab the latest nightly (wikipedia + standalone tool) and unzip them to my desktop so that the rules don't give errors (since the grammar.xml file in the repository had changes in the examples and if I try with my version, TESTRULES PT will throw errors).
Yes, I am a lazy arse… I will wait for 5am when my brain is fresh…
😄
I will then reply here when it is done.
Luckily, the official release is delayed by a few days, so I guess we can still make this fix in the official release.
Ok! Yes, I am following a little the disambiguator work, it's necessary and fine! I'm still dealing with the beginning of classes in the university where I work, so these last days (and probably some others to come) I've had less time to dedicate to LT issues.
I have been writing down tons of improvements to be done for the release in three months.
I will need your help and advice for many of them since I want it to be top.
Also, in the upcoming months I want to fix the verbs suggestions (antipatterns) and will need your help since I am confused in some cases for uses as: "Ela não me deu a chave" and "Ela não deu-me a chave"… I need to be 100% sure of such things and proper exceptions before I code antipatterns… I will have to revise the whole antipatterns.
Also, the commas rules… my mother gave herself to the trouble of printing documents that explain all the comma rules, but I have been a lazy arse to read them… I must find inner strength to do it.
The good news is that in September or October the doctor will ask officially for my retirement request and if all goes well I should be retired some months later (I hope it won't take a year or two), then I will have more time for LanguageTool and other projects.
@ricardojosehlima
Like promised, here is the fix: https://github.com/languagetool-org/languagetool/commit/889bad0c5abe6631eb2547dd711291083309e362
https://github.com/languagetool-org/languagetool/commit/6a54166be2d364e4754315940d11bf6d451d5d33
@ricardojosehlima
Hello!
Are the words “input” and “output” foreign in PT-Brazilian?
I forgot to add them as foreign words.
I only noticed today.
@ricardojosehlima
Hello!
Are the words “input” and “output” foreign in PT-Brazilian?
I forgot to add them as foreign words.
I only noticed today.
No, they'rent
I have been writing down tons of improvements to be done for the release in three months.
I will need your help and advice for many of them since I want it to be top.
Also, in the upcoming months I want to fix the verbs suggestions (antipatterns) and will need your help since I am confused in some cases for uses as: "Ela não me deu a chave" and "Ela não deu-me a chave"… I need to be 100% sure of such things and proper exceptions before I code antipatterns… I will have to revise the whole antipatterns.
Also, the commas rules… my mother gave herself to the trouble of printing documents that explain all the comma rules, but I have been a lazy arse to read them… I must find inner strength to do it.
Ok count me in!
@ricardojosehlima
Look at this: https://forum.languagetool.org/t/reminder-upcoming-feature-freeze-for-languagetool-5-8/8024/4
The official release is delayed until next week.
I will try to fix some more gender and number agreements during the weekend.
The more is fixed the better.
😄
Brazilian Portuguese is known for being more flexible with use of foreign words than European Portuguese. Words like layout, e-mail, pizza, item are accepted even in more formal registers.
Recently, I have made some changes in the barbarisms-pt-BR.txt file but I have noticed that there is a barbarism-pt.txt file that seems to apply to all variants of Portuguese. In this case, how to proceed?
My take would be (I) to remove the words and expressions that must not be replaced in Brazilian to the barbarisms-pt-PT.txt file and leave them there.
Also, there are some words that lead to LT flagging in a different manner.
For example, 'layout' is in that barbarisms-pt-PT.txt file and the message is "layout é um estrangeirismo. É preferível dizer disposição".
But for 'item' the message is "Os estrangeirismos devem estar entre aspas ou ser italizados" For cases like this I couldn't find (II) where LT sees it this way, if it is in another file or in the grammar.xml