languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.34k stars 1.39k forks source link

[pt] On foreign words in Brazilian Portuguese #6700

Open ricardojosehlima opened 2 years ago

ricardojosehlima commented 2 years ago

Brazilian Portuguese is known for being more flexible with use of foreign words than European Portuguese. Words like layout, e-mail, pizza, item are accepted even in more formal registers.

Recently, I have made some changes in the barbarisms-pt-BR.txt file but I have noticed that there is a barbarism-pt.txt file that seems to apply to all variants of Portuguese. In this case, how to proceed?

My take would be (I) to remove the words and expressions that must not be replaced in Brazilian to the barbarisms-pt-PT.txt file and leave them there.

Also, there are some words that lead to LT flagging in a different manner.

For example, 'layout' is in that barbarisms-pt-PT.txt file and the message is "layout é um estrangeirismo. É preferível dizer disposição".

But for 'item' the message is "Os estrangeirismos devem estar entre aspas ou ser italizados" For cases like this I couldn't find (II) where LT sees it this way, if it is in another file or in the grammar.xml

marcoagpinto commented 2 years ago

@ricardojosehlima

There are variables in grammar.xml with the words:

    <!ENTITY barbarismos "(?:s(?:t(?:r(?:ip(?:tease[rs]?|per|s)?|ess.*|ogonoff|aight)|a(?:nd(?:ard|s)?|ccato|rtup|ff)|o(?:ryboard|kes|ck|p)|i(?:ck|lb)|e(?:nt|p))|p(?:r(?:int(?:er|s)?|ead|ay|ue)|in(?:naker|s)?|a[ms]?|ot)|u(?:b(?:-holding|woofer)|s(?:pense|hi)|perstar|doku|rf)|h(?:ar(?:eware|-pei)|o(?:[tw]|pping)|iatsu|ekel|unt)|a(?:(?:xhor|loo)n|mple[rs]?|shimi|vart)|c(?:herz(?:and)?o|rewball|o[np]e|at)|n(?:o(?:wboard|oker|b)|ack(?:-bar)?)|o(?:(?:cialit|ttovoc|ftwar)e|ul)|m(?:o(?:(?:kin)?g|rzando)|ash)|l(?:i(?:ck|de|p)|alom|ogan)|e(?:t(?:-point|ter)?|xy)|w(?:eat(?:shirt|er)|ing)|i(?:decar|evert|ngle|c)|fo(?:rzand|gat)o|k(?:ipper|ate)|quash)|c(?:a(?:n(?:ta(?:bile|ta)|yoning|iche)|r(?:diofitness|paccio|ré)|(?:che-sex|lzon)e|m(?:p(?:us)?|ber)|t(?:ering|gut)|sh-and-carry)|o(?:r(?:don-bleu|pus)|u(?:lomb|ntry)|(?:spla|wbo)y|(?:ck|v)er|loratura|ntinuum|okie)|h(?:a(?:r(?:leston|treuse)|ise-longue)|e(?:ong-sam|ck-in|ddar)|utney)|l(?:ear(?:ance|s)?|u(?:ster|b)|ipboard|oisonné|arke)|r(?:o(?:issan|que)t|evasse)|zar(?:s?ti|ina)|(?:élad|ân)on|u(?:mbia|p)|ódex|ent)|p(?:a(?:nach(?:és?|es?)|t(?:chouli|hos)|intball|parazzi|hoehoe|lmier|anga|rsec|ssim)|i(?:n(?:s(?:cher)?|ce-nez|g-pong|-?up)?|(?:anofort|ckl)e|ercing|dgin|lé)|o(?:st(?:er(?:iori|s)|-it)|(?:odl|ch|is)e|ltergeist|t-?pourri|grom|p)|e(?:(?:tit-suiss|nc)e|restroika|corino)|r(?:i(?:ori|se)|omenade|essing|áxis)|lay(?:b(?:ack|oy)|station|maker)|hot(?:o(?:-finish|maton)|s)?|u(?:tter|zzle|nk|b))|t(?:a(?:(?:sk-forc|wni)e|l(?:k-show|iban)|ke(?:away|s)?|n(?:dem|k)|volatura|ekwondo|i-chi|blet)|u(?:t(?:(?:ilimúnd|t)i|u)|rbo-diesel|pperware|grik)|r(?:i(?:p(?:-hop|lex)|al)|a(?:velling|iler)|emolo)|i(?:me(?:-sharing|share)|e-break|ramisu|cket)|e(?:le(?:marketing|x)|(?:rylen)?e|chno|flon)|o(?:p(?:(?:les)?s)?|ner|fu|ri)|h(?:esaur(?:us|i)|ink)|(?:w(?:ee|is)|-shir)t|sunami|ópos)|b(?:a(?:by-(?:sitter|doll|grow)|(?:varois|din)e|ckup|ht|rn)|r(?:e(?:akdance|nt)|u(?:shing|nch)|ainstorming|ie)|e(?:ta-tester|nedictus|cquerel|agle|bop)|o(?:bsleigh|dyboard|nsai|xers|ate|rt)|i(?:t(?:map|s)?|odesign|g-bang|p)|u(?:ngee-jumping)|l(?:ister|ague|ues|og)|yte)|k(?:i(?:l(?:o(?:(?:vol|wat)t|b(?:yte|it))|im|t)|t(?:chenette|s(?:ch)?)?|ckbox(?:ing|er)|n[ag]|butz|p)|a(?:r(?:a(?:oke|té)|bovanet|ting)|lash(?:nikov|s)?|mikaze|sba)|r(?:(?:ípto|emli)n|aft|ill)|wa(?:(?:ch|nz)a|shiorkor)|e(?:f(?:fieh|ir)|tchup)|un(?:g-fu|a)|yat)|r(?:o(?:c(?:k(?:er|s)?|aille)|(?:entge|ll-o)n|ttweiler|quefort|aming|okie|deo)|e(?:a(?:lpolitik|dy-made)|p(?:rise|s)|dneck|ggae|m)|i(?:n(?:forzando|ggit|k)|ckettsia|tardando|ff)|a(?:l(?:lentando|enti)|p(?:per|s)?|fting|ve)|é(?:veillon|gie|tro)|öntgen|ubato)|m(?:a(?:t(?:ch-point|rioska)|r(?:keting|chand)|(?:estos|mb)o|(?:st?e|yo)r|na(?:ger|t)|gnificat|jorette|xwell|quis)|i(?:l(?:curie|ady)|s(?:erere|ter)|ndfulness|crofarad)|o(?:d(?:e(?:rato|m)|us)|hair)|e(?:morandum|dley)|u(?:sic-hall|esli)|vdol)|f(?:o(?:[bg]|x(?:-terrier|trot)?|r(?:tran|int)|ndue|yer|lk)|a(?:i(?:t-divers|r-play)|rad(?:ay|s)?|nfreluche|twa|x)|l(?:a(?:sh(?:es)?|menco|t)|i(?:p-flop|nt)|ute)|u(?:n(?:board|ky?)|gato|ton)|r(?:anchising|eeware)|ermata|itness|öhn)|a(?:n(?:ti(?:-(?:establishment|apartheid|dumping)|trust)|gström)|p(?:felstrudel|paratchik|artheid|lomb)|l(?:legr(?:ett)?o|zheimer|ibi)|c(?:celerandos?|id-jazz)|uto(?:pullman|cross)|git(?:-?prop|ato)|yurveda|mabile|irbus)|d(?:o(?:p(?:ing|pler)|wnhill|car|jo|ng)|r(?:ive(?:[rs]|-in)?|ugstore)|e(?:sign(?:er)?|ficit|bye)|i(?:s(?:cman|eur)|rham)|u(?:mping|plex|ce)|a(?:nzón|tcha))|g(?:i(?:ga(?:byte|watt)|rlsband|lbert|nseng)|o(?:(?:odwil|spe)l|belet|uda)|l(?:a(?:snost|mour)|ide)|u(?:aracha|lag)|ru(?:yèr|ng)e|a[ly]|eyser)|h(?:a(?:b(?:it(?:at|us)|anera)|(?:m-ioc-chon|shta)g|rd-rock)|i(?:p(?:p(?:ie|y)|-hop)|drospeed)|o(?:rseball|lding|mo)|eavy-metal|usky|ype)|v(?:i(?:de(?:ocl(?:ip(?:es)?|ub)|s)?|(?:t(?:rin|a)|ntag|vac)e|brato)|e(?:r(?:nissage|sus)|lcro|gan)|o(?:lt(?:e-face|s)?|yeur)|audeville)|l(?:e(?:[dk]|a(?:sing|d)|itmotiv|gato)|o(?:c(?:kout|us)|oping|ess|gin)|a(?:rghetto|ser|mé|ts)|i(?:ngerie|fting))|o(?:ff(?:s(?:hore|et)|ice-boy|line)|s(?:tpolitik|sobuco)|u(?:tsider|guiya)|verbooking|n-?line|ersted|rigami|pus)|j(?:a(?:m(?:-session|boree)|c(?:kpot|uzzi)|zz)|o(?:int-ventur|ul)e|u(?:kebox|nkie)|et-(?:lag|set)|iu-jitsu)|w(?:a(?:(?:lkie-talki|ffl)e|rrant|sp|d)|e(?:b(?:er|s)?|stern)|i(?:ld-card|ndsurf)|o(?:rkshop|n)|hist)|i(?:n(?:ter(?:f(?:eron|ace)|net)|f(?:otainment|luenza)|-octavo|s)|(?:bid|t)em|mpedimenta|ppon|d)|n(?:e(?:(?:cessair|w-ag)e|(?:tspli)?t)|o(?:menklatura|ir)|apalm|uance|ylon)|e(?:n(?:s(?:alada|emble)|tente)|r(?:satz|g)|mmenthal|cstasy|vasé|dam)|qu(?:a(?:lifying|ntum|rk)|i(?:lohertz|che))|z(?:e(?:itgeist|kel|n)|apping)|y(?:u(?:ppie|an)|ang|eti|in)|u(?:ndergroun|ploa)d)">
    <!ENTITY barbarismos2 "b(?:irdwatching|lockchains?|odyboarders?)|backdoors?|bots?|c(?:hipset|rowdfunding)s?|desktops?|DNA|dominatrix(?:es)?|draft|geocach(?:ing|ers?)|h(?:atchback|ijab|otspot|overboard)s?|icebergs?|jetpacks?|k(?:ernels?|evlar)|m(?:alware|illennial)s?|n(?:etworking|otch|uggets?)|overclock(?:ings?)|p(?:arkour|hishing|odcast|unchline)s?|RNA|s(?:martwatch(?:es)|ext(?:ing|ortion)|tormtroppers?|treaming)|trackpads?|w(?:ebsite|halewatching|oks?)">
    <!ENTITY barbarismos3 "abstracts?|applets?|apps?|backbones?|baconburgers?|banners?|bitcoins?|bits?|blue|bluetooth|bogie|boids?|boost|bottom[-]up|brainstorms?|bullying|burnouts?|cameraman|carjacking|chairman|contactless|crackers?|cracking|crashes|cyberbullying|czar|czares|debugging|demand|developers?|dildos?|downgrades?|downtime|drones?|e[-]bullying|e[-]commerce|e[-]learning|e[-]manuals?|e[-]newsletters?|e[-]readers?|exabits?|exabytes?|fair[-]play|faxes|feeders?|flats?|flyers?|forward|führers?|gaming|gangbang|geeks?|gigabits?|gigabytes?|goals?|gold|hackers?|homebanking|homepages?|idem|jiu[-]jitsu|kilobits?|kilobytes?|know[-]how|KO|layers?|lob|loops?|lossless|marketeers?|maydays?|megabits?|megabytes?|milkshakes?|nerds?|netiquette|newsgroups?|off[-]label|offside|on[-]label|overflows?|p[-]values?|papers?|patch|patches|pens?|petabits?|petabytes?|players?|preview|ransomware|reboot|red|reggaeton|restart|roadmaps?|royalty|royalties|scams?|scammers?|screenshots?|scroll|seals?|shell|shelters?|smartphones?|sockets?|stalking|stealth|strings?|swarms?|tablets?|tagging|takeovers?|terabits?|terabytes?|thresholds?|tips?|toolkits?|tracking|triggers?|trolls?|tweeters?|underscores?|Unicode|updates?|upgrades?|uptime|Usenet|wargames?|warm[-]up|webinar|webinares|webmasters?|webpages?|wireless|woofers?|yes[-]man|yottabits?|yottabytes?|zettabits?|zettabytes?">
marcoagpinto commented 2 years ago

The easiest way is to find which words there aren't foreign in pt-BR, and I will add them to an exception token and then create a rule in the pt-PT folder grammar.xml with these words.

Basically, it is to create a rule with just the words added to exception in the main grammar.xml.

Then, duplicate the original rule and make it accept only pt-PT.

marcoagpinto commented 2 years ago

ahhhhh... I am unable to write properly 🙂

ricardojosehlima commented 2 years ago

@marcoagpinto Great! I will try to work on it asap!

ricardojosehlima commented 2 years ago

@marcoagpinto There are two files attached here:

barbarisms-pt-BR_email.txt refers to barbarisms-pt-BR.txt and the only change is to exclude the suggestions at the end of the file for replacing email and variants for some translations. The change is marked with a * before the words.

barbarisms-pt_nao_BR.txt refers to barbarisms-pt.txt the file which applies to all variants of Portuguese. The words to be selected for the exception you mentioned are marked with a * before them.

Now, for the ENTITY barbarismos 1, 2 and 3 that are in the grammar.xml I recognize many words that in Brazilian standard register no one cares if it has '' or not: suspense, slide, streaming, blockchain, bitcoin. I wonder if any of those would apply for '' so my suggestion is to create some sort of exception for Brazilian variant for these 3 rules not to apply. barbarisms-pt-BR_email.txt barbarisms-pt_nao_br.txt

marcoagpinto commented 2 years ago

@ricardojosehlima

I will look at it at 5am 🙂

I promise 🙂

marcoagpinto commented 2 years ago

right now I can't focus

marcoagpinto commented 2 years ago

@ricardojosehlima

Hello!!!!

I have created a variable for specific pt-BR barbarisms in the main grammar.xml to make it easier to add specific words in the future.

I noticed that the rule already existed (created by me) in pt-PT, so I just added your words to the list.

https://github.com/languagetool-org/languagetool/commit/9f18075d52c1fad03f1646680aded7da46f0e132

https://github.com/languagetool-org/languagetool/commit/ec6d2827d6ed89b2e91606b6f9c08e38ccf4fa56

marcoagpinto commented 2 years ago

@ricardojosehlima

Regarding the .txt files you attached above, could you please commit them or make PRs so that I just need to press one button for them to go live?

This way there won't be any risks of me messing up.

Thanks!

ricardojosehlima commented 2 years ago

Hi @marcoagpinto I made the PR for the barbarisms-pt-BR which was to remove the e-mail suggestions. However, the other file is for both variants barbarisms-pt and if I remove the suggestions (they are 163) from there, I think that it will affect the variants other than Brazilian, am I right? So, how to proceed?

marcoagpinto commented 2 years ago

Well, just send the full files to me via-mail, and I will commit them.

This way sucess=100%

marcoagpinto commented 2 years ago

@ricardojosehlima

It is done!

https://github.com/languagetool-org/languagetool/commit/bf614bbec164c1e9ef4329d99e98a9ee88521b78

Thanks!

marcoagpinto commented 2 years ago

I didn't see the PR request, I have been processing video and audio.

ricardojosehlima commented 2 years ago

Hi @marcoagpinto the word 'item' is still being flagged in pt-br as foreign, can you check what is going on?

marcoagpinto commented 2 years ago

@ricardojosehlima

Sure, it should be very easy to fix 😃

I will fix it at 5am…

I don't know if you have been following the conversation regarding the disambiguator, but now more verbs which were detected as nouns are truly detected as verbs.

I need to grab the latest nightly (wikipedia + standalone tool) and unzip them to my desktop so that the rules don't give errors (since the grammar.xml file in the repository had changes in the examples and if I try with my version, TESTRULES PT will throw errors).

Yes, I am a lazy arse… I will wait for 5am when my brain is fresh…

😄

I will then reply here when it is done.

Luckily, the official release is delayed by a few days, so I guess we can still make this fix in the official release.

ricardojosehlima commented 2 years ago

Ok! Yes, I am following a little the disambiguator work, it's necessary and fine! I'm still dealing with the beginning of classes in the university where I work, so these last days (and probably some others to come) I've had less time to dedicate to LT issues.

marcoagpinto commented 2 years ago

I have been writing down tons of improvements to be done for the release in three months.

I will need your help and advice for many of them since I want it to be top.

Also, in the upcoming months I want to fix the verbs suggestions (antipatterns) and will need your help since I am confused in some cases for uses as: "Ela não me deu a chave" and "Ela não deu-me a chave"… I need to be 100% sure of such things and proper exceptions before I code antipatterns… I will have to revise the whole antipatterns.

Also, the commas rules… my mother gave herself to the trouble of printing documents that explain all the comma rules, but I have been a lazy arse to read them… I must find inner strength to do it.

marcoagpinto commented 2 years ago

The good news is that in September or October the doctor will ask officially for my retirement request and if all goes well I should be retired some months later (I hope it won't take a year or two), then I will have more time for LanguageTool and other projects.

marcoagpinto commented 2 years ago

@ricardojosehlima

Like promised, here is the fix: https://github.com/languagetool-org/languagetool/commit/889bad0c5abe6631eb2547dd711291083309e362

https://github.com/languagetool-org/languagetool/commit/6a54166be2d364e4754315940d11bf6d451d5d33

marcoagpinto commented 2 years ago

@ricardojosehlima

Hello!

Are the words “input” and “output” foreign in PT-Brazilian?

I forgot to add them as foreign words.

I only noticed today.

ricardojosehlima commented 2 years ago

@ricardojosehlima

Hello!

Are the words “input” and “output” foreign in PT-Brazilian?

I forgot to add them as foreign words.

I only noticed today.

No, they'rent

ricardojosehlima commented 2 years ago

I have been writing down tons of improvements to be done for the release in three months.

I will need your help and advice for many of them since I want it to be top.

Also, in the upcoming months I want to fix the verbs suggestions (antipatterns) and will need your help since I am confused in some cases for uses as: "Ela não me deu a chave" and "Ela não deu-me a chave"… I need to be 100% sure of such things and proper exceptions before I code antipatterns… I will have to revise the whole antipatterns.

Also, the commas rules… my mother gave herself to the trouble of printing documents that explain all the comma rules, but I have been a lazy arse to read them… I must find inner strength to do it.

Ok count me in!

marcoagpinto commented 2 years ago

@ricardojosehlima

Look at this: https://forum.languagetool.org/t/reminder-upcoming-feature-freeze-for-languagetool-5-8/8024/4

The official release is delayed until next week.

I will try to fix some more gender and number agreements during the weekend.

The more is fixed the better.

😄