Small fix to how lengths are filtered (need to be per dataset always even if all datasets are loaded at once and concatenated. The way it was coded, min/max tokens caused a lot of rows to be dropped if all datasets were loaded at once e.g., by following min tokens of stories for the much shorter lengths in dailydialog).
Features are now more minimally dropped, only dropping NA cols (and not all the cols with high percentages of zeroes) as also mentioned in #63 . However, an issue will also be opened for how to do this the proper way.
New prelim results after having dropped less raw features
Small fixes
Small fix to how lengths are filtered (need to be per dataset always even if all datasets are loaded at once and concatenated. The way it was coded, min/max tokens caused a lot of rows to be dropped if all datasets were loaded at once e.g., by following min tokens of
stories
for the much shorter lengths indailydialog
).Features are now more minimally dropped, only dropping NA cols (and not all the cols with high percentages of zeroes) as also mentioned in #63 . However, an issue will also be opened for how to do this the proper way.
New prelim results after having dropped less raw features