serega / gaoya

Locality Sensitive Hashing
MIT License
70 stars 6 forks source link

export func insert_sig, query_one, query_sig, query_sig_return_distance,tokens2signature,iter,size #25

Open Ox0400 opened 1 year ago

serega commented 1 year ago

Thanks for the pull request @Ox0400 . I will take a look during the weekend.

serega commented 1 year ago

Hi. Curios, how do you plan on using the new methods ? Personally, I only use Rust part of the project, and do not use the Python bindings. I implemented them to learn Python/Rust integration. My intention for Python version to use the python wrappers simhash.py and minhash.py, where you can provide your own tokenizer. Are you using gaoya.gaoya.simhash.SimHash128StringIntIndex directly ?

Ox0400 commented 1 year ago

Hi. Curios, how do you plan on using the new methods ? Personally, I only use Rust part of the project, and do not use the Python bindings. I implemented them to learn Python/Rust integration.你好。好奇心,您打算如何使用新方法?就我个人而言,我只使用项目的 Rust 部分,而不使用 Python 绑定。我实现它们是为了学习 Python/Rust 集成。 My intention for Python version to use the python wrappers simhash.py and minhash.py, where you can provide your own tokenizer. Are you using gaoya.gaoya.simhash.SimHash128StringIntIndex directly ?我希望 Python 版本使用 python 包装器 simhash.pyminhash.py ,您可以在其中提供自己的分词器。您直接使用 gaoya.gaoya.simhash.SimHash128StringIntIndex 吗?

IM

Hi. Curios, how do you plan on using the new methods ? Personally, I only use Rust part of the project, and do not use the Python bindings. I implemented them to learn Python/Rust integration. My intention for Python version to use the python wrappers simhash.py and minhash.py, where you can provide your own tokenizer. Are you using gaoya.gaoya.simhash.SimHash128StringIntIndex directly ?

Hi, yes, I was using SimHash64StringIntIndex, like this.

from gaoya.simhash import SimHashStringIndex

class SimHashTool(SimHashStringIndex):
    def size(self) -> int:
        return self.index.size()

    def iter(self) -> List[Tuple[int, int]]:
        # [(100, 879782272769711604), (101, 879782272769711604)]
        return self.index.iter()

    def par_bulk_tokens2signatures(self, tokens_list: List[List[str]]) -> List[int]:
        return self.index.par_bulk_tokens2signatures(tokens_list)

    def par_bulk_insert_sig_pairs(self, id_sig_pairs: List[Tuple[int, int]]) -> int:
        self.index.par_bulk_insert_sig_pairs(id_sig_pairs)
        return self.size()

    def query_tokens_return_distance(self, tokens: List[str]) -> List[Tuple[int,int]]:
        return self.index.query_tokens_return_distance(tokens)

    def insert_tokens(self, doc_id: int, tokens: List[str]) -> None:
        self.index.insert_tokens(doc_id, tokens)

    def par_bulk_insert_tokens_pairs(self, id_tokens_pairs: List[Tuple[int, List[str]]]) -> int:
        self.index.par_bulk_insert_tokens_pairs(id_tokens_pairs)
        return self.size()
    # more functions ....