thiswillbeyourgithub / wdoc

Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype, scalable, under developpement
GNU General Public License v3.0
143 stars 12 forks source link

Using recursed_filetype fails to parse recursive paths #12

Closed rfishermonteith closed 2 weeks ago

rfishermonteith commented 2 weeks ago

Using recursed_filetype fails to parse recursive paths (full error below).

This seems fixable by adding

"recursed_filetype": str, 
"pattern": str 

to https://github.com/thiswillbeyourgithub/wdoc/blob/main/wdoc/utils/misc.py#L149

However, this does create the following warning (so I may have missed something):

Cannot set key 'pattern' in a DocDict. Allowed keys are 'loading_failure,anki_tag_filter,filetype,audio_unsilence,json_dict_exclude_keys,doccheck_max_token,load_functions,recur_parent_id,source_tag,anki_template,youtube_translation,whisper_prompt,doccheck_min_token,youtube_language,file_hash,whisper_lang,path,doccheck_min_lang_prob,anki_tag_render_filter,online_media_url_regex,json_dict_template,youtube_audio_backend,anki_profile,anki_deck,audio_backend,pdf_parsers,deepgram_kwargs,anki_notetype,online_media_resourcetype_regex'
You can use the env variable WDOC_STRICT_DOCDICT to avoid this issue.

Full command:

python -m wdoc
--path="data_for_wdoc"
--filetype="recursive_paths"
--task=search
--query="How can I make wdoc run faster?"
--query_retrievers='default_multiquery'
--top_k=auto_200_500
--llms_api_bases="{'model':'http://localhost:11434','query_eval_model':'http://localhost:11434'}"
--modelname="ollama/gemma2:2b"
--query_eval_modelname="ollama/gemma2:2b"
--recursed_filetype="txt"
--pattern="*.txt"

Full error below


Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "XXX/venv/lib/python3.11/site-packages/wdoc/__main__.py", line 140, in <module>
    cli_launcher()
  File "XXX/venv/lib/python3.11/site-packages/wdoc/__main__.py", line 69, in cli_launcher
    fire.Fire(wdoc)
  File "XXX/venv/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "XXX/venv/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "XXX/venv/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "<@beartype(wdoc.wdoc.wdoc.__init__) at 0x1323ae7a0>", line 14, in __init__
  File "XXX/venv/lib/python3.11/site-packages/wdoc/utils/misc.py", line 703, in new_func
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "XXX/venv/lib/python3.11/site-packages/wdoc/wdoc.py", line 511, in __init__
    self.loaded_docs = batch_load_doc(
                       ^^^^^^^^^^^^^^^
  File "<@beartype(wdoc.utils.batch_file_loader.batch_load_doc) at 0x1019f0540>", line 110, in batch_load_doc
  File "XXX/venv/lib/python3.11/site-packages/wdoc/utils/batch_file_loader.py", line 158, in batch_load_doc
    parse_recursive_paths(
  File "<@beartype(wdoc.utils.batch_file_loader.parse_recursive_paths) at 0x12f5d2e80>", line 155, in parse_recursive_paths
TypeError: parse_recursive_paths() missing 2 required positional arguments: 'pattern' and 'recursed_filetype'
thiswillbeyourgithub commented 2 weeks ago

Thanks a lot for both issues. I think I'll have time for it tomorow evening. I have a good idea of what's going on already (in good parts thanks to your time cost efficient issue!)

thiswillbeyourgithub commented 2 weeks ago

Using recursed_filetype fails to parse recursive paths (full error below).

This seems fixable by adding

"recursed_filetype": str, 
"pattern": str 

to https://github.com/thiswillbeyourgithub/wdoc/blob/main/wdoc/utils/misc.py#L149

You're right to think that but it's actually wrong in that case. The var filetype_args_types contains the list of arguments that are accepted as argument to a loader function. But extra_args_types are args that are accepted by wdoc, like path for example. The recursed filetype and pattern args are actually destined to the parse_recursive_paths function at https://github.com/thiswillbeyourgithub/wdoc/blob/a0acfd7d57344499c4c72870c08e47897cf13461/wdoc/utils/batch_file_loader.py#L535

Hence the issue seems to arise from the recursice parser not being triggered. Resulting in those args being sent to a loader function.

I'll have a fix soon as it's an issue I've encountered in the passed.

The reason I created this DocDict thingie is to validate arguments for the loader func and help distinguish them from wdoc args and from recursive args. If you happen to know of a more orthodox way to manage all those things, then I would gladly accept an explanation or even a PR because I think that's one of the weak aspects of wdoc currently. Similarly, I plan to completely split how the gigantic methods of the initial wdoc class do things. Notably, this will help with making a cleaner API in Python, something like what scikitlearn does, for example, which is quite intuitive.

thiswillbeyourgithub commented 2 weeks ago

Fixed. This was due to extra_args_types missing pattern and recursed_filetype. Thanks a lot. I pushed both fix to the latest release in 2.4.3

Don't hesitate to reach out if you found more bugs, it really helps a lot!