State files, output format

lejon / PartiallyCollapsedLDA

Implementations of various fast parallelized samplers for LDA, including Partially Collapsed LDA, Light LDA, Partially Collapsed Light LDA and a very efficient Polya-Urn LDA

26 stars 20 forks source link

State files, output format #22

Open rebeckahw opened 1 year ago

rebeckahw commented 1 year ago

Is it possible to get state output formatted similarly to the state files produced by Mallet from PCLDA? I.e. with the columns: doc source pos typeindex type topic

The same state information can be recreated from the z_.csv files in combination with the corpus and vocabulary files, but it would be a nice-to-have

lejon commented 1 year ago

I'll have a look. But I can't promise a quick turn around time I'm afraid... Life keeps getting in the way nowadays! :)

lejon commented 1 year ago

I have now a 9.2.0 release with (hopefully) this supported, would be glad if you could test it and verify that it works as expected.

rebeckahw commented 1 year ago

Thank you very much! It works as expected with one small exception. With mallet, I get a column named source as the second column. The contents of the column is not really important, but it perhaps affects what is expected to find in the remaining columns if they are parsed based on column order.

lejon commented 1 year ago

Ok, thanks for that feedback. I must have been looking at some old spec. I'll have a look.

lejon commented 1 year ago

Hi, sorry for the delay on this. I have now checked the MALLET code and as far as I can see, I'm using the same format. Here is the relevant MALLET code:

public void printState (PrintWriter pw)
  {
      Alphabet a = ilist.getDataAlphabet();
      pw.println ("#doc pos typeindex type topic");
      for (int di = 0; di < topics.length; di++) {
          FeatureSequence fs = (FeatureSequence) ilist.get(di).getData();
          for (int si = 0; si < topics[di].length; si++) {
              int type = fs.getIndexAtPosition(si);
              pw.print(di); pw.print(' ');
              pw.print(si); pw.print(' ');
              pw.print(type); pw.print(' ');
              pw.print(a.lookupObject(type)); pw.print(' ');
              pw.print(topics[di][si]); pw.println();
          }
      }
  }

lejon commented 1 year ago

Hmm, so I was tricked by different implementations of printState in different samplers, so the ParallelTopicModel implements print state in another way which does indeed get the source...

Will use the version with source here also.

lejon commented 1 year ago

It seems that in MALLET almost always the source will be NA... What is the expected value of source?

lejon commented 1 year ago

Have a 9.2.1 version that adds source, but it will basically always be NA. If this field is used I can add the proper info there, but I'm a bit unclear what it is expected to contain.

rebeckahw commented 1 year ago

Just having the "NA" will be very helpful. (It seems like this field could perhaps be used to preserve some extra information about the input. It is commented as / The input in a reproducable form, e.g. enabling re-print of string w/ POS tags, usually without target information, e.g. an un-annotated RegionList. /)

liamtabib commented 1 year ago

Hi! Måns here. How do we change the config file to get the state file from a PCLDA run?

lejon commented 1 year ago

If you mean the topic indicators (Z)…PartiallyCollapsedLDA/Configuration-README.md at master · lejon/PartiallyCollapsedLDAgithub.comCheers,-LeifOn 8 Jun 2023, at 16:38, Liam Tabibzadeh @.***> wrote: Hi! Måns here. How do we change the config file to get the state file from a PCLDA run?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

MansMeg commented 1 year ago

Hi! No, the state files (that you discuss above)?