Closed EvgeniiaVak closed 3 years ago
refs #4
sums up the
stats_data
distributions - by language and by authorthe same authors (if any) in different languages are treated as different, I guess it's ok since it doesn't make much sense to have distributions by words from multiple languages, but by the same author
It's OK as long as there is a remark about this in the README.md file (and later in the final report) about this
- by author result is not committed, takes up ~155MB ()without words). Do we still need it for the viz? Basically almost all data are numbers. Maybe redo them into something that stores it as binary to save space (I'm not sure what would work here, maybe SQLite would do the trick)?
It would be a nice thing to be able to visualize per author, there are two things to take into account here, one is being able to visualize this, another is taking into account the resource utilization not only in size, but in size that is downloaded for each case. As examples:
conllu project
)Nevertheless, do not commit those 155MB to this repository, but once all is ready let's put the aggregated result data in a tree structure in gutenberg project repository
- words data is not committed too - didn't wait for it to finish (was computing the whole night, English part takes a lot of time)
For the words aggregation I think is enough to separate by language only and have this as CSV files with the following format:
word, count
and sort them from the most used to the least used one
If there are (and I'm sure there are) noisy things they will mostly be at the end of that file, and we should leave them, making a comment about this in the report.
Do we need to aggregate in some other way than summing up?
If the question is for the word aggregation, summing is enough, although there are a few other things you might need to check (and now you'll learn by experience) and that is (in order):
Before doing that you could also try to fix some tokenization issues (words that were not correctly separated or spurious data), but I wouldn't worry so much about it unless you want to practice and learn cleaning some data.
Note:
To normalize data with python you can use:
import unicodedata
unicodedata.normalize('NFKC', mystringhere)
@leomrocha yes, I would love to fix the tokenization issues too, do you know of any pointers on where to start with that (after the normalizing I mean)?
@leomrocha yes, I would love to fix the tokenization issues too, do you know of any pointers on where to start with that (after the normalizing I mean)?
For this case is tough because I've already did some tokenization beforehand, so the harder cases are the ones that are left.
Nevertheless, always the first step is just going through the data (in this case check the least common words, starting from botton up which should contain many errors) and try to find common patterns, then you write a small code that will take advantage of those patterns.
For example, what if you find something like: this-word
and ultimate-championship
? you could decide that these are either correct, or these should be split by the -
hyphen.
There might be other characters that you might not want in the words, for example if you see something like my word}
in this case you would want to cleanup the }
character.
Be careful when cleaning data of not introducing noise for the correct ones, in this case you might want to run your cleaner only on the least frequent data for example choosing words that have 2 or less occurrences, or only 1 instance.
as for how to do this, there are first the str.split and str.replace methods
for more complex cases I woudl recommend the regex library https://docs.python.org/3/library/re.html
@leomrocha about normalizing, should we do it before the distributions per book are counted? There is a char_length
distribution, I think that might be affected.
@leomrocha about normalizing, should we do it before the distributions per book are counted? There is a
char_length
distribution, I think that might be affected.
Yes, this would be the best approach but would need a reprocessing of all the data. Nevertheless if you want to do it please go ahead.
Although there are some statistics that can't be computed from the words only and I don't think it is worth right now to go back and reprocess everything from the raw books (these are the statistics per sentence and per paragraph).
@EvgeniiaVak I'll merge and close this PR, as we are discussing the next steps already
stats_data
distributions - by language and by authorDo we need to aggregate in some other way than summing up?