TF IDF with Morel - Githubissues

segeljakt commented 2 years ago

Hi, I tried to implement TF-IDF (Term Frequency - Inverse Document Frequency) in Morel, and I almost managed to get it to work.

The formula that is used to compute tf-idf is defined as follows:

tf-idf(t, d) = tf(t, d) * idf(t)
- t is a term
- d is a document in a document set
- idf(t) = log [n/df(t)] + 1 is the inverse document frequency
- n is the total number of documents in the document set
- df(t) is the document frequency of t
  - i.e., number of documents containing the term t
- tf(t, d) is the term frequency of t in d
  - i.e., number of occurrences of the term t within the document d
- 1 is added so that terms which occur in all documents will not be entirely ignored.

This is the code:

(* Split a string by whitespace *)
fun split s =
    let
        fun split0 [] "" words = words
          | split0 [] word words = word :: words
          | split0 (#" " :: s) "" words = split0 s "" words
          | split0 (#" " :: s) word words = split0 s "" (word :: words)
          | split0 (c :: s) word words = split0 s (word ^ (String.str c)) words
    in
    List.rev (split0 (String.explode s) "" [])
end;

val docs = [
  { name="doc0.txt", text="a a a d b b b c c c" },
  { name="doc1.txt", text="a c c d e f e e i j" },
  { name="doc2.txt", text="e f c d f f i h a j" },
  { name="doc3.txt", text="a i c j e f g h i j" },
  { name="doc4.txt", text="e f j d d f i d d j" },
  { name="doc5.txt", text="b i c i e j g j i g" },
  { name="doc6.txt", text="a a c d b b e e j j" },
  { name="doc7.txt", text="a b j j e f f h i j" }
];

val n = List.length docs;

(* Compute the TF-IDF for all terms in docs *)
from doc in docs, term in split doc.text
yield {doc.name, term}
group name, term compute tf = count
(* Everything above up until this point works great *)
group term compute df = count
(* There are two problems at this point:
    1. `tf` is no longer a field in the record after the second group operation.
    2. We did not filter out duplicate documents when calculating `df`.
*)
yield {name, term, tfidf = tf * (log ((n / df) + 1))};

Is there a way to calculate tf-idf that I might have missed? It would be interesting to see if it was possible. I am unsure, but maybe a unique relational operator is needed to calculate df.

segeljakt commented 2 years ago

The first problem seems to get solved if I calculate df and tf separately and then join them. It is possible to calculate unique entries with group but I haven't figured it out yet completely. The only thing remaining after that is the logarithm function.

val tfs =
    from doc in docs, term in split doc.text
    yield {doc.name, term}
    group name, term compute tf = count;

val dfs =
    from doc in docs, term in split doc.text
    yield {doc.name, term}
    group term compute df = count

val tfidfs =
    from tf in tfs
    join df in dfs on tf.term = df.term
    yield {tf.term, tf.name, tfidf = tf.tf * (n/df.df + 1)};

julianhyde commented 2 years ago

Thanks for giving Morel a try.

This is an interesting problem, because the different measures require you to group at different granularities. That is hard to do in relational algebra because - using my favorite analogy, the pasta machine - you need to run the pasta through the machine more than once. SQL gives us GROUPING SETS, but it's still hard.

So I settled on an approach that uses correlated functions. Your approach produces collections that could be later joined on the common attributes, but it's basically similar. With some query optimization magic, perhaps both solutions could produce the same physical plan.

Here's my solution:

fun docTerms () =
  let
    val rawDocTerms =
      from doc in docs,
          t in split doc.text
        yield {d = doc.name, t}
    val n = only (from d in docs group compute count)
    fun tf (t, d) =
      only (from dt in rawDocTerms
        where dt.t = t
        andalso dt.d = d
        group compute count)
    fun idf t =
      Math.ln (real n /
        real (only (from r in rawDocTerms
            where r.t = t
            group r.d
            group compute count)))
  in
    from {d, t} in rawDocTerms
      group d, t
      yield {d, t, tf = tf (t, d), idf = idf t}
  end;
val docTerms = fn : unit -> {d:string, idf:real, t:string, tf:int} list

from x in (docTerms ()) where x.d = "doc5.txt";
val it =
  [{d="doc5.txt",idf=0.13353144,t="e",tf=1},
   {d="doc5.txt",idf=0.6931472,t="b",tf=1},
   {d="doc5.txt",idf=0.28768212,t="c",tf=1},
   {d="doc5.txt",idf=0.13353144,t="j",tf=2},
   {d="doc5.txt",idf=0.28768212,t="i",tf=3},
   {d="doc5.txt",idf=1.3862944,t="g",tf=2}]
  : {d:string, idf:real, t:string, tf:int} list

You'll need to use my https://github.com/julianhyde/morel/tree/0088-math branch (work in progress for #88). It includes the Math.ln and Real.fromInt (aka real) functions that I just added. It also includes the use function (to be fixed in #86).

There is potentially also a solution that uses clever aggregate functions over sets, but I didn't have time to write it.

julianhyde commented 2 years ago

Oops, I missed the '+ 1'. Here is a revised solution that sorts by idf-tf:

fun docTerms () =
  let
    val rawDocTerms =
      from doc in docs,
          t in split doc.text
        yield {d = doc.name, t}
    val n = only (from d in docs group compute count)
    fun tf (t, d) =
      only (from dt in rawDocTerms
        where dt.t = t
        andalso dt.d = d
        group compute count)
    fun df t =
        only (from r in rawDocTerms
            where r.t = t
            group r.d
            group compute count)
    fun idf t = Math.ln (real n / real (df t)) + 1.0
    fun idf_tf (t, d) = (idf t) * real (tf (t, d))
  in
    from {d, t} in rawDocTerms
      group d, t
      yield {d, t, tf = tf (t, d), idf = idf t, idf_tf = idf_tf (t, d)}
  end;
val docTerms = fn
  : unit -> {d:string, idf:real, idf_tf:real, t:string, tf:int} list

from x in (docTerms ())
  where x.t elem ["b", "g"]
  order x.idf_tf desc;
val it =
  [{d="doc0.txt",idf=1.6931472,idf_tf=5.0794415,t="b",tf=3},
   {d="doc5.txt",idf=2.3862944,idf_tf=4.7725887,t="g",tf=2},
   {d="doc6.txt",idf=1.6931472,idf_tf=3.3862944,t="b",tf=2},
   {d="doc3.txt",idf=2.3862944,idf_tf=2.3862944,t="g",tf=1},
   {d="doc5.txt",idf=1.6931472,idf_tf=1.6931472,t="b",tf=1},
   {d="doc7.txt",idf=1.6931472,idf_tf=1.6931472,t="b",tf=1}]
  : {d:string, idf:real, idf_tf:real, t:string, tf:int} list

segeljakt commented 2 years ago

Thank you, this code is super clean! It is interesting how the relational and functional worlds intersect 🤔

Is the goal for a program like this to execute on a distributed system like Spark or Flink using Calcite, or does Calcite translate the query plan back into Morel? I am also wondering, is there a chance Morel will support data streams?

julianhyde commented 2 years ago

Yes, absolutely. The goal is to provide a language more powerful than SQL that can be executed as efficiently as SQL, on any system that can execute SQL.

In my StrangeLoop talk, I talk about how Morel can implement WordCount. WordCount is the examplar of data-parallel systems like MapReduce and Spark. Those systems execute DAGs of extended relational algebra (extended with aggregate, table functions, and iteration). Via Calcite we can generate such DAGs.

julianhyde commented 2 years ago

By the way, the program I have written is a query. The functions, when inlined, become correlated scalar sub-queries. The query is initially correlated and seems to make multiple passes over the data, but those problems can be solved via rewrites. I expect that the relational algebra plan would include Aggregate with multiple grouping sets, but that can be evaluated efficiently in a data parallel system.

Someone could work backwards from that plan to a SQL query that contains GROUP BY ... GROUPING SETS and calls a "split" user-defined scalar function. That query would run efficiently on most modern DBs. But I claim that writing that query would be difficult and error-prone, and the task is much easier in Morel.

hydromatic / morel

TF IDF with Morel #87