TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

LLLeoLi commented 3 months ago

summarizationModel: gemma-2b-it
QAModel: gemma-2b-it
embeddingModel: bge-m3 when I run the demo with a longer text, RA.add_documents(text) will raise this TypeError.

catle2aurecon commented 3 months ago

Ran into the same problem, it relates to a tree_builder.build_from_text function. It would be the problem regardless of LLM model choices.

JacksonCakes commented 3 months ago

I had the same problem as well. Seems like related to https://github.com/MaartenGr/BERTopic/issues/97#issuecomment-1831494493

catle2aurecon commented 3 months ago

Here is how I avoid the above error:

I change the reduction_dimension to 5 instead of 10
I restricted the number of layers 2 for constructing the tree. The mentioned configuation works for my type of data; hence, you might need to do a bit of trial and error.

2024-03-18 21:48:13,833 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 2
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <__main__.ROOTSummarizationModel object at 0x7f3ee5622a10>
            Embedding Models: {'EMB': <__main__.SBertEmbeddingModel object at 0x7f3db828bf50>}
            Cluster Embedding Model: EMB

        Reduction Dimension: 5
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}

2024-03-18 21:48:13,833 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 2
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <__main__.ROOTSummarizationModel object at 0x7f3ee5622a10>
            Embedding Models: {'EMB': <__main__.SBertEmbeddingModel object at 0x7f3db828bf50>}
            Cluster Embedding Model: EMB

        Reduction Dimension: 5
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}

2024-03-18 21:48:13,833 - Successfully initialized RetrievalAugmentation with Config 
        RetrievalAugmentationConfig:

        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 2
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <__main__.ROOTSummarizationModel object at 0x7f3ee5622a10>
            Embedding Models: {'EMB': <__main__.SBertEmbeddingModel object at 0x7f3db828bf50>}
            Cluster Embedding Model: EMB

        Reduction Dimension: 5
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}

        TreeRetrieverConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Context Embedding Model: EMB
            Embedding Model: <__main__.SBertEmbeddingModel object at 0x7f3db828bf50>
            Num Layers: None
            Start Layer: None

            QA Model: <__main__.ROOTQAModel object at 0x7f3dbc9ec6d0>
            Tree Builder Type: cluster

parthsarthi03 commented 3 months ago

Should be fixed with #16

parthsarthi03 / raptor

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k. #15