Closed mhmd-mst closed 4 months ago
@mhmd-mst Hi, I'll answer briefly.
faiss index
(or any index e.g. list, dict) to file name and timestamp within the file
. So, you have to build it by yourself.Thanks for you reply,
fake_recon_index, index_shape = load_memmap_data( emb_dummy_dir, 'dummy_db', append_extra_length=query_shape[0], display=False) fake_recon_index[dummy_db_shape[0]:dummy_db_shape[0] + query_shape[0], :] = db[:, :]
also the append_extra_length = query_shape[0]
would cause a problem if batchsize of db and query doesnt match right?Thanks again for you help.
fake_recon_index = db
The query_shape[0], dummy_db_shape[0], or data_shape[0] are all about the number of fingerprints you have. In other words, it is data length, not a batch size.
FYI, #17 explains the search process for sequence query.
For debugging with small query and DB, my suggestion:
1) As explained in #17 and the mini_search_subroutine, you can directly perform the search and calculate the distance or likelihood between qeuries and db items.
2) Moving on, you try faiss search using L2
index. Its output should exactly match the result of 1).
3) Finally, you try IVFPQ
index. It requires training of index that takes several minutes. IVFPQ
uses quantized representation, and it is a data-dependent trainable quatization technique. Usually there is some amount of performance drop when you use faiss, compared to 1) and 2). It is a matter of trade-off between speed and accuracy by practice. This implies that if your data has proper amount (I can't figure it out in number, because it also depends on the parameters of faiss setup), you won't need to use our dummy data. If your data is small and is music, you can use the dummy data. If your data is small and not music (e.g. speech), you need some replacement of dummy data from other sources for training index. After the index is constructed, you can just remove the dummies and it is also possible to store the trained index. I cannot give you further detail information on this, but the issue board of faiss project will help.
faiss
library itself doesn't have such an additional table for storing names in text. faiss
is only specialized to vector-search
. Depending on the scale of your search DB, the table can be implemented as an additional python dict() or more conventional DBs such as mySQL and elastic search.
In the code I noticed if for example the query set contains 3 audio files, the query.mm file will have fp of all of them, if we have query_shape of all fps as (9,128) such that 3 fps for each audio, if I choose test_seq_len as '3' , the np.dot() used in the enumerate(candidates) loop will not be as meaningful as the case in #17 because now the fps are part of two queries am I right in this interpretation?
Also this part pred_ids = candidates[np.argsort(-_scores)[:10]]
give the top 10 most likely ids for the first fp in the query considered right?
@mhmd-mst 3 audio files, and 3 fps per each file. Your fps.shape = (9, 128). So, fps[0:3] is your first file.
test_ids = [0, 3, 6] # start_id of query sequence from each file
With this test_ids, the enumerate(cadidates)
loop is still important, unless sequence length is 1
. The loop averages likelihood between the sequences over highly plausible sequential combination candidates ([1,2,3] vs. [2,3,4] vs... but not [3,2,1] nor [5,1,3]). By the way, k_probe
by default value 20 is also too large for your setup. It means 1s-segment-level-search result for each fp, up to rank 20 for each fp (e.g. fps[0], not a sequence-level search) will be considered. I am not sure if this would cause errors though.
Also this part pred_ids = candidates[np.argsort(-_scores)[:10]] give the top 10 most likely ids for the first fp in the query considered right?
Yes. If you used test_ids =[0,3,6] with seq_len=3
,and in the first iteration of loop, pred_ids
here will contain top10 3s-sequence-search-result for the first query sequence (fps[0,1,2]) starting with test_id=0
.
Hello, Regarding the fingerprint of a segment itself, how unique is it, like if i have a very huge db, will there be a segment with same fingerprint as another segment
It depends on the dimension of fingerprint and how to define the same
segment. In the training process, all segments except themselves are expressed differently, and the larger the dimension, the more unique it becomes.
Hello, I am using a pretrained model that you commented in one of your issues, and I want to use it for my own dataset, I modified the dataset class:
and the get_data_source function in generate.py:
I am using a sample of 3 db audios files and 2 query audio files.
Thank in advance, and thanks for you greate work