rdenadai / BR-BERTo

Transformer model for Portuguese language (Brazil pt_BR)
https://huggingface.co/rdenadai/BR_BERTo
MIT License
15 stars 3 forks source link

Questions #1

Open sachaarbonel opened 4 years ago

sachaarbonel commented 4 years ago

Hi @rdenadai, thank's for your great work! I was playing around with your model on hugging face and I got results starting with `Ġ, is it normal? Plus, I wanted to know if you were willing to collaborate on a finetuned pos model. My understanding is that we need a conllu dataset such as UD_Portuguese-GSD and to clean it up to fit the format used by @mrm8488 in his notebook I started working on tool to cleanup such datasets.

rdenadai commented 4 years ago

Hi @sachaarbonel thanks for taking the time to test the model... i'm thinking in better improve it, i find it sometimes wonky, but for now doing so will be cost me like U$ 180,00 on GCP (i got a much more bigger corpus [check out my repo with a Word2Vec model trained on this new corpus] to train and perhaps got a bigger vocab and more training epochs).

Anyway, as for your question about the Ġ, i think is normal based on the Tokernizer used (and here i'm doing some inference based on other models that are trained using ByteLevelBPETokenizer like roberta-base, but i need to better explore this space to give you a more precise answer here.

As for the collaboration, of course i could, i really appreciate that. And for the record, i'm ideia for the model (the next step), would be train on a pos-tagged and my main area of research that is sentiment analysis.

I'm already in search for more than the UD dataset for portuguese pos-tagged... and found the following two great repo on github.

sachaarbonel commented 4 years ago

Nice! I didn't found those datasets at the time I researched the subject. Wow 180$ sounds like a lot! It corresponds to one week of training? I'll try to finish up the tool to clean up the datasets. I don't have the time this weekend maybe next week. I'll keep you updated

mrm8488 commented 4 years ago

Hi, Rodolfo. So you point in your model card you used the HF script for training a model from scratch. So how much data did you use? I have done several experiments and as its scripts uses LineByLineTextDataset and it loads everything in memory so could not train more than about 600 MB of data. Is this your case or did you modified anything?

Bests, Manu

El vie., 31 jul. 2020 a las 19:46, Sacha Arbonel (notifications@github.com) escribió:

Nice! I didn't found those datasets at the time I researched the subject. Wow 180$ sounds like a lot! It corresponds to one week of training? I'll try to finish up the tool to clean up the datasets. I don't have the time this weekend maybe next week. I'll keep you updated

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdenadai/BR-BERTo/issues/1#issuecomment-667250277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA34BHMHX43B3A22B3FWXELR6L7NRANCNFSM4PPOI46Q .

rdenadai commented 4 years ago

Nice! I didn't found those datasets at the time I researched the subject. Wow 180$ sounds like a lot! It corresponds to one week of training? I'll try to finish up the tool to clean up the datasets. I don't have the time this weekend maybe next week. I'll keep you updated

It correspond to almost ~4 days of training in a T4 or P100 it depends... but now the dataset i'm using is almost x3 bigger than the one previously trained.

No worries about the datasets... i'm also going to explore then as i mentioned above.

rdenadai commented 4 years ago

@mrm8488 i do need to change the LineByLineTextDataset... i build my own class based on this one, since i'm now thinking in training using a 2.5Gb dataset.

The model that is on huggingface web site is one trained with 900mb corpus... a small corpus.

Please check out the code here => https://github.com/rdenadai/BR-BERTo/blob/master/transformer.py#L16

It lazy loads each line of the dataset using pandas read_csv method...

mrm8488 commented 4 years ago

I see, good job. Your CustomDataset does not scale? I need to build one for 200GB of data. I think I am gonna use HF/NLP that convert even plain text files to Apache Arrow format.

El vie., 31 jul. 2020 21:29, Rodolfo De Nadai notifications@github.com escribió:

@mrm8488 https://github.com/mrm8488 i do need to change the LineByLineTextDataset... i build my own class based on this one, since i'm now thinking in training using a 2.5Gb dataset.

The model that is on huggingface web site is one trained with 900mb corpus... a small corpus.

Please check out the code here => https://github.com/rdenadai/BR-BERTo/blob/master/transformer.py#L16

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdenadai/BR-BERTo/issues/1#issuecomment-667313312, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA34BHJD26AEZQXDT2XV7NTR6MLR7ANCNFSM4PPOI46Q .

rdenadai commented 4 years ago

Didn't try with that much of data... only with 2.5Gb... one thing you could try to change in my custom class is, instead of loading the file using pandas, change the line to use Dask... this way you could scale up much more, and since Dask uses pyarrow in it's internal you could use that to read parquet files.

The simple approuch you could try, is point the pandas to your file and see if it could load... and the use Dask... and them change the class to fit better you needs.

mrm8488 commented 4 years ago

Thank you so much for your advice

El vie., 31 jul. 2020 21:51, Rodolfo De Nadai notifications@github.com escribió:

Didn't try with that much of data... only with 2.5Gb... one thing you could try to change in my custom class is, instead of loading the file using pandas, change the line using Dask... this way you could scale up much more, and since Dask uses pyarrow in it's internal you could use that to read parquet files.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdenadai/BR-BERTo/issues/1#issuecomment-667322186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA34BHKXBGFU2IV5JUTG2MDR6MOCZANCNFSM4PPOI46Q .

mrm8488 commented 4 years ago

And why seq_length = 128 instead of 512? You wanted to try a small model first?

El vie., 31 jul. 2020 22:11, Manuel Romero mrm8488@gmail.com escribió:

Thank you so much for your advice

El vie., 31 jul. 2020 21:51, Rodolfo De Nadai notifications@github.com escribió:

Didn't try with that much of data... only with 2.5Gb... one thing you could try to change in my custom class is, instead of loading the file using pandas, change the line using Dask... this way you could scale up much more, and since Dask uses pyarrow in it's internal you could use that to read parquet files.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdenadai/BR-BERTo/issues/1#issuecomment-667322186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA34BHKXBGFU2IV5JUTG2MDR6MOCZANCNFSM4PPOI46Q .

rdenadai commented 4 years ago

Yeap... I don't have enough power (in my computar i have a GTX 1060) or money (GCP is in dolar) to make a bigger model for nos.

Em sex, 31 de jul de 2020 18:02, Manuel Romero notifications@github.com escreveu:

And why seq_length = 128 instead of 512? You wanted to try a small model first?

El vie., 31 jul. 2020 22:11, Manuel Romero mrm8488@gmail.com escribió:

Thank you so much for your advice

El vie., 31 jul. 2020 21:51, Rodolfo De Nadai notifications@github.com escribió:

Didn't try with that much of data... only with 2.5Gb... one thing you could try to change in my custom class is, instead of loading the file using pandas, change the line using Dask... this way you could scale up much more, and since Dask uses pyarrow in it's internal you could use that to read parquet files.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdenadai/BR-BERTo/issues/1#issuecomment-667322186, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA34BHKXBGFU2IV5JUTG2MDR6MOCZANCNFSM4PPOI46Q

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdenadai/BR-BERTo/issues/1#issuecomment-667355115, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHAADH3RJ7L5CRP7PAGBGDR6MWOTANCNFSM4PPOI46Q .

sachaarbonel commented 4 years ago

I can help cleaning datasets if you guys need

rdenadai commented 4 years ago

@sachaarbonel thanks, all the help are appreciate, since im thinking in rerun BR_BERTo again (perhaps choose different parameters), and the do the pos-tagger, i have some time to clean up the pos-tagger datasets i mention above, and of course for my second task to this model which is sentiment analysis (i need to build this dataset, since didnt found a good dataset for that in portuguese, i did have one, but i revisiting it show me that i should better take a good care on this).

In case you guys want the dataset i`m using to train the BR_BERTo, just say and i can send a Google Drive link so you can download it.

sachaarbonel commented 4 years ago

I guess creating a sentiment analysis dataset should be feasible by applying a translation model from hugginface on a English dataset

rdenadai commented 4 years ago

Yeap that's one way... And i'm thinking in doing this, one problem is check if each phase is corrected translated or not.

And most datasets, only have 2-3 labels (positive/negative/neutral), since Affective Computing is my area of research interest i'm looking for a more wide range of emotions.