vi3k6i5 / GuidedLDA

semi supervised guided topic model with custom guidedLDA
Mozilla Public License 2.0
497 stars 108 forks source link

if sparse and not np.issubdtype(doc_word.dtype, int) issue!!!! #55

Open ThomasADuffy opened 3 years ago

ThomasADuffy commented 3 years ago

Hey all, I ran into an issue but also found a fix! I was passing a sparse matrix into the guidedLDA and there was an error i was getting where it was raising an error due to this if statement being reached in the utils.py


def matrix_to_lists(doc_word):
    """Convert a (sparse) matrix of counts into arrays of word and doc indices

    Parameters
    ----------
    doc_word : array or sparse matrix (D, V)
        document-term matrix of counts

    Returns
    -------
    (WS, DS) : tuple of two arrays
        WS[k] contains the kth word in the corpus
        DS[k] contains the document index for the kth word

    """
    if np.count_nonzero(doc_word.sum(axis=1)) != doc_word.shape[0]:
        logger.warning("all zero row in document-term matrix found")
    if np.count_nonzero(doc_word.sum(axis=0)) != doc_word.shape[1]:
        logger.warning("all zero column in document-term matrix found")
    sparse = True
    try:
        # if doc_word is a scipy sparse matrix
        doc_word = doc_word.copy().tolil()
    except AttributeError:
        sparse = False
    if sparse and not np.issubdtype(doc_word.dtype, int):
        raise ValueError("expected sparse matrix with integer values, found float values") <-----------------------------

    ii, jj = np.nonzero(doc_word)
    if sparse:
        ss = tuple(doc_word[i, j] for i, j in zip(ii, jj))
    else:
        ss = doc_word[ii, jj]

    n_tokens = int(doc_word.sum())
    DS = np.repeat(ii, ss).astype(np.intc)
    WS = np.empty(n_tokens, dtype=np.intc)
    startidx = 0
    for i, cnt in enumerate(ss):
        cnt = int(cnt)
        WS[startidx:startidx + cnt] = jj[i]
        startidx += cnt
    return WS, DS

The reason for this is because the data type of the sparse matrix going in gets converted to a little matrix and has a np.int64 data type which does not equate to base level "int" so I had to change it to np.int 64 in order to circumvent this issue, so the new one function just has this changed


    if sparse and not np.issubdtype(doc_word.dtype, np.int64):
        raise ValueError("expected sparse matrix with integer values, found float values")

Everything now is working as usual. let me know how i can do a commit request,push request if needed as i have not done it before. I believe a better work around would be a catch all like datatype isin then a list of int versions, because they should all work with LDA.

On windows 10-python3.8.5

ParitoshSingh07 commented 3 years ago

Would love to see this implemented, it sounds like it's only the faulty ValueError that's stopping the use of Sparse Matrix, while the underlying code can handle sparse matrix perfectly well.

hhagedorn commented 3 years ago

Thank you for the solution! On my machine (Windows 10 & Python 3.9) np.int64 did not solve it, but substituting it with np.integer did.