Within-dataset weighing

sandsq / alc-rs

MIT License

0 stars 0 forks source link

Within-dataset weighing #19

Open sandsq opened 5 months ago

sandsq commented 5 months ago

[x] for an ngram holder, store totals so that frequencies can be more easily computed
[x] within a dataset, apply weighing to each different ngram -- for simplicity, use equal weight
[ ] precompute counts into frequencies rather than do so at every scoring step

sandsq commented 5 months ago

currently, we divide the counts of a given ngram by the total at each step of the computation instead of precomputing the frequencies at the start, which would only need to be done once. not gonna worry about that for now

sandsq commented 5 months ago

Actually scaling might not be so straightforward because longer ngrams means longer sequences means higher effort scores, even if scale by frequency. Is this desireable?