Gutenberg aggregate - Githubissues

EvgeniiaVak commented 3 years ago

sums up the stats_data distributions - by language and by author
the same authors (if any) in different languages are treated as different, I guess it's ok since it doesn't make much sense to have distributions by words from multiple languages, but by the same author
by author result is not committed, takes up ~155MB ()without words). Do we still need it for the viz? Basically almost all data are numbers. Maybe redo them into something that stores it as binary to save space (I'm not sure what would work here, maybe SQLite would do the trick)?
words data is not committed too - didn't wait for it to finish (was computing the whole night, English part takes a lot of time)

Do we need to aggregate in some other way than summing up?

EvgeniiaVak commented 3 years ago

refs #4

leomrocha commented 3 years ago

sums up the stats_data distributions - by language and by author

the same authors (if any) in different languages are treated as different, I guess it's ok since it doesn't make much sense to have distributions by words from multiple languages, but by the same author

It's OK as long as there is a remark about this in the README.md file (and later in the final report) about this

by author result is not committed, takes up ~155MB ()without words). Do we still need it for the viz? Basically almost all data are numbers. Maybe redo them into something that stores it as binary to save space (I'm not sure what would work here, maybe SQLite would do the trick)?

It would be a nice thing to be able to visualize per author, there are two things to take into account here, one is being able to visualize this, another is taking into account the resource utilization not only in size, but in size that is downloaded for each case. As examples:

the user wants to check AUTHOR X, downloads only the data about AUTHOR X in json (or the visualization already prepared like in the conllu project)
the user wants to check AUTHOR X, downloads an entire DB to query from in order to create the graph. 155 MB is not THAT much to set in github if we are going to just use it to serve the data by individual json files for the frontend to display.

Nevertheless, do not commit those 155MB to this repository, but once all is ready let's put the aggregated result data in a tree structure in gutenberg project repository

words data is not committed too - didn't wait for it to finish (was computing the whole night, English part takes a lot of time)

For the words aggregation I think is enough to separate by language only and have this as CSV files with the following format:

word, count

and sort them from the most used to the least used one

If there are (and I'm sure there are) noisy things they will mostly be at the end of that file, and we should leave them, making a comment about this in the report.

Do we need to aggregate in some other way than summing up?

If the question is for the word aggregation, summing is enough, although there are a few other things you might need to check (and now you'll learn by experience) and that is (in order):

Normalizing the characters as NFKC -> check the wikipedia article and
Normalizing by case (pass all to lowercase)
Then aggregate by word count

Before doing that you could also try to fix some tokenization issues (words that were not correctly separated or spurious data), but I wouldn't worry so much about it unless you want to practice and learn cleaning some data.

Note:

To normalize data with python you can use:

import unicodedata

unicodedata.normalize('NFKC', mystringhere)

EvgeniiaVak commented 3 years ago

@leomrocha yes, I would love to fix the tokenization issues too, do you know of any pointers on where to start with that (after the normalizing I mean)?

leomrocha commented 3 years ago

@leomrocha yes, I would love to fix the tokenization issues too, do you know of any pointers on where to start with that (after the normalizing I mean)?

For this case is tough because I've already did some tokenization beforehand, so the harder cases are the ones that are left.

Nevertheless, always the first step is just going through the data (in this case check the least common words, starting from botton up which should contain many errors) and try to find common patterns, then you write a small code that will take advantage of those patterns.

For example, what if you find something like: this-word and ultimate-championship? you could decide that these are either correct, or these should be split by the - hyphen.

There might be other characters that you might not want in the words, for example if you see something like my word} in this case you would want to cleanup the } character.

Be careful when cleaning data of not introducing noise for the correct ones, in this case you might want to run your cleaner only on the least frequent data for example choosing words that have 2 or less occurrences, or only 1 instance.

as for how to do this, there are first the str.split and str.replace methods

for more complex cases I woudl recommend the regex library https://docs.python.org/3/library/re.html

EvgeniiaVak commented 3 years ago

@leomrocha about normalizing, should we do it before the distributions per book are counted? There is a char_length distribution, I think that might be affected.

leomrocha commented 3 years ago

@leomrocha about normalizing, should we do it before the distributions per book are counted? There is a char_length distribution, I think that might be affected.

Yes, this would be the best approach but would need a reprocessing of all the data. Nevertheless if you want to do it please go ahead.

Although there are some statistics that can't be computed from the words only and I don't think it is worth right now to go back and reprocess everything from the raw books (these are the statistics per sentence and per paragraph).

leomrocha commented 3 years ago

@EvgeniiaVak I'll merge and close this PR, as we are discussing the next steps already

leomrocha / mix_nlp

Gutenberg aggregate #5