katanaml / sparrow

Data processing with ML, LLM and Vision LLM
https://katanaml.io
GNU General Public License v3.0
3.73k stars 379 forks source link

Model assuming NoneType instead of string. #48

Closed tribiona closed 8 months ago

tribiona commented 8 months ago

I am using gemma-7b model. I am getting this error for the above pdf:

ws_nm_ncqa_recred_oe_batch_desc Input should be a valid string [type=string_type, input_value=None, input_type=NoneType] For further information visit https://errors.pydantic.dev/2.6/v/string_type other_specialities Input should be a valid string [type=string_type, input_value=None, input_type=NoneType] For further information visit https://errors.pydantic.dev/2.6/v/string_type upin_number Input should be a valid string [type=string_type, input_value=None, input_type=NoneType] For further information visit https://errors.pydantic.dev/2.6/v/string_type taxonomy_code Input should be a valid string [type=string_type, input_value=None, input_type=NoneType] For further information visit https://errors.pydantic.dev/2.6/v/string_type

Looks like wherever the output value in the document is None or No information provided, the model is assuming NoneType but I cannot give NoneType as type in the command as I won't know in advance which fields would be none. Is there a way the model could ignore such fields. Sme change in prompts? Or any other solution to this problem?

abaranovskis-redsamurai commented 8 months ago

Gemma is quite bad in data extraction. Use either Starling or adrienbrault/nous-hermes2pro:Q5_K_M-json as per config

tribiona commented 8 months ago

I need a model for commercial use.

abaranovskis-redsamurai commented 8 months ago

https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B comes with Apache license, it means available for commercial use.

tribiona commented 8 months ago

Is it the same as adrienbrault/nous-hermes2pro:Q5_K_M-json ?

abaranovskis-redsamurai commented 8 months ago

yes, correct.