lejon / PartiallyCollapsedLDA

Implementations of various fast parallelized samplers for LDA, including Partially Collapsed LDA, Light LDA, Partially Collapsed Light LDA and a very efficient Polya-Urn LDA
26 stars 20 forks source link

doc-topic distr. #19

Open mhbodell opened 3 years ago

mhbodell commented 3 years ago

Outputen sparad av "save_doc_theta_estimate = true" har fel dimensioner och uutputen visar inte heller proportioner utan counts.

Detta står i README.txt-filen:

Save the a file with document topic theta estimates (will not include zeros)

Unlike Phi means which are sampled with thinning, theta means is just a simple

average of the topic counts in the last iteration divided by the number of

tokens in the document thus there is not theta_burnin or theta_thinning

save_doc_theta_estimate = true doc_topic_theta_filename = doc_topic_theta.csv

Har en model med 200 ämnen men doc_theta_means filen har 400 kolumner och antal dokument som rader? Varför är antalet kolumner dubbla antalet ämnen i modellen?

Config-file:

configs = Spalias no_runs = 1

[Spalias] title = PCPLDA description = 200 topics with alpha 0.2 and extended priorlist dataset = data/fb_politics_news.txt scheme = spalias_priors seed = 1904 topics = 200 alpha = 0.2 beta = 0.01 iterations = 1500 rare_threshold = 0 batches = 4 topic_batches = 4 topic_interval = 500 start_diagnostic = 200 debug = 0

log_type_topic_density = true

log_document_density = true log_phi_density = true phi_mean_filename = phi-mean.csv phi_mean_burnin = 20 phi_mean_thin = 5 stoplist = nsc-test/PartiallyCollapsedLDA-8.4.0/stoplist-empty.txt save_vocabulary = true vocabulary_filename = lda_vocab.txt topic_prior_filename = wfw/bash/priors/k200_v7.txt keep_connecting_punctuation = true log_topic_indicators = true save_sampler = false save_doc_theta_estimate = true doc_topic_theta_filename = doc_topic_theta.csv save_phi_mean = true

Jag bifogar en bild av delar av outputen så du ser hur den ser ut.

Screen Shot 2021-04-20 at 10 06 37
rebeckahw commented 1 year ago

The problem seems to stem from WriteASCIIDoubleMatrix. Decimal numbers are written with commas both as decimal separators and column separators. This adds an extra column for each printed value and every other column gets the value 0.

lejon commented 1 year ago

Yes, I noticed this bug also, and have a fix in 9.2.0, for parts of the problem, but will have to double check if this is also solved with that fix...

lejon commented 1 year ago

9.2.0 should solve this problem

rebeckahw commented 1 year ago

The test for WriteASCIIDoubleMatrix now passes, but the problem unfortunately remains for me. It could maybe? be caused by the method formatDouble in LDAUtils.java:

        String formatString = "%." + noDigits + "f";
        return String.format(formatString, d);

since String.format() depends on defaultLocale (which for me is SE)

lejon commented 1 year ago

Yes, it is due to locale and it is a bit of a mess now unfortunately, the combination of Locale and possibility of selecting separator makes it complicated... I'll have a look and see if I can re-design to a better solution.