dkim94 commented 3 months ago

Describe the bug The Ingest API returns status_code=200 and num_docs_for_processing, even though no document was actually ingested.

The Quickwit Instance log shows that a doc mapper parse error occurred because there was no timestamp in the data. However from the client's side we get status_code=200 along with num_docs_for_processing so we are led to believe that the documents were successfully ingested.

Steps to reproduce (if applicable) Steps to reproduce the behavior:

Call Ingest API Example: Python code


response = requests.post(
            "127.0.0.1:7280/api/v1/test_index/ingest?commit=force",
            headers={"content-type": "application/json"},
            data=data
        )

print(response.status_code) # 200 print(response.text) # { "num_docs_for_processing": 22 }

2. Check Quickwit Instance running on 127.0.0.1:7280

... 2024-08-28T09:39:37.929Z WARN quickwit_indexing::actors::doc_processor: doc mapper parse error: the document must contain field "timestamp field is required" index_id="code-meta" source_id="_ingest-api-source" ... 2024-08-28T09:39:37.936Z INFO index-doc-batches{index_id=code-meta source_id=_ingest-api-source pipeline_uid=01J6BRSCJ843ANR9RKF941KT0N workbench_id=01J6C5NQTAJY6DV91TB5PFHAHK}:publisher{split_update=SplitsUpdate { index_id: "code-meta", new_splits: "", checkpoint_delta: Some(_ingest-api-source:∆(ingest_partition_01J6BRQBP10KVEQMK5RHTY7F2F:(00000000000000000124..00000000000000000147])) }}: quickwit_indexing::actors::publisher: publish-new-splits new_splits=[] checkpoint_delta=Some(_ingest-api-source:∆(ingest_partition_01J6BRQBP10KVEQMK5RHTY7F2F:(00000000000000000124..00000000000000000147]))


**Expected behavior**
From the Quickwit instance log you can see that no splits were created. However the client is unaware of the fact.
If no documents were ingested due to an error from the Quickwit instance, the `status_code` should be something like 500.

**Configuration:**
Please provide:

1. Output of `quickwit --version`
`Quickwit 0.8.1 (x86_64-unknown-linux-gnu 2024-03-29T14:09:41Z e6c5396)`
3. The index_config.yaml

============================ Node Configuration ==============================

#

Website: https://quickwit.io

Docs: https://quickwit.io/docs/configuration/node-config

#

Configure AWS credentials: https://quickwit.io/docs/guides/aws-setup#aws-credentials

#

-------------------------------- General settings --------------------------------

#

Config file format version.

# version: 0.7

cluster_id: ${QW_CLUSTER_ID} #

Node ID. Must be unique within a cluster. If not set, a random node ID is generated on each startup.

#

node_id: node-1

#

Quickwit opens three sockets.

- for its HTTP server, hosting the UI and the REST API (TCP)

- for its gRPC service (TCP)

- for its Gossip cluster membership service (UDP)

#

All three services are bound to the same host and a different port. The host can be an IP address or a hostname.

#

Default HTTP server host is `127.0.0.1` and default HTTP port is 7280.

The default host value was chosen to avoid exposing the node to the open-world without users' explicit consent.

This allows for testing Quickwit in single-node mode or with multiple nodes running on the same host and listening

on different ports. However, in cluster mode, using this value is never appropriate because it causes the node to

ignore incoming traffic.

There are two options to set up a node in cluster mode:

1. specify the node's hostname or IP

2. pass `0.0.0.0` and let Quickwit do its best to discover the node's IP (see `advertise_address`)

# listen_address: ${QW_LISTEN_ADDRESS} #

rest:

listen_port: 7280

cors_allow_origins:

- "http://localhost:3000"

extra_headers:

x-header-1: header-value-1

x-header-2: header-value-2

#

grpc:

max_message_size: 10 MiB

#

IP address advertised by the node, i.e. the IP address that peer nodes should use to connect to the node for RPCs.

The environment variable `QW_ADVERTISE_ADDRESS` can also be used to override this value.

The default advertise address is `listen_address`. If `listen_address` is unspecified (`0.0.0.0`),

Quickwit attempts to sniff the node's IP by scanning the available network interfaces.

advertise_address: 192.168.0.42

#

In order to join a cluster, one needs to specify a list of

seeds to connect to. If no port is specified, Quickwit will assume

the seeds are using the same port as the current node gossip port.

By default, the peer seed list is empty.

#

peer_seeds:

- quickwit-searcher-0.local

- quickwit-searcher-1.local:10000

#

Path to directory where temporary data (caches, intermediate indexing data structures)

is stored. Defaults to `./qwdata`.

#

data_dir: /path/to/data/dir

#

Metastore URI. Defaults to `data_dir/indexes#polling_interval=30s`,

which is a file-backed metastore and mostly convenient for testing. A cluster would

require a metastore backed by Amzon S3 or PostgreSQL.

#

metastore_uri: s3://your-bucket/indexes

metastore_uri: s3://${MY_BUCKET}/qw_indexes

metastore_uri: postgres://username:password@host:port/db

#

When using a file-backed metastore, the state of the metastore will be cached forever.

If you are indexing and searching from different processes, it is possible to periodically

refresh the state of the metastore on the searcher using the `polling_interval` hashtag.

#

metastore_uri: s3://your-bucket/indexes#polling_interval=30s

#

Default index root URI, which defines where index data (splits) is stored,

following the scheme `{default_index_root_uri}/{index-id}`. Defaults to `{data_dir}/indexes`.

#

default_index_root_uri: s3://your-bucket/indexes

default_index_root_uri: s3://${MY_BUCKET}/qw_indexes #

-------------------------------- Storage settings --------------------------------

#

Hardcoding credentials into configuration files is not secure and strongly

discouraged. Prefer the alternative authentication methods that your storage

backend may provide.

#

storage:

azure:

account: ${QW_AZURE_STORAGE_ACCOUNT}

access_key: ${QW_AZURE_STORAGE_ACCESS_KEY}

#

s3:

access_key_id: ${AWS_ACCESS_KEY_ID}

secret_access_key: ${AWS_SECRET_ACCESS_KEY}

# region: ${AWS_REGION}

endpoint: ${QW_S3_ENDPOINT}

force_path_style_access: ${QW_S3_FORCE_PATH_STYLE_ACCESS:-false}

disable_multi_object_delete: false

disable_multipart_upload: false

#

-------------------------------- Metastore settings --------------------------------

#

metastore:

postgres:

min_connections: 0

max_connections: 10

acquire_connection_timeout: 10s

idle_connection_timeout: 10min

max_connection_lifetime: 30min

#

-------------------------------- Indexer settings --------------------------------

indexer: enable_otlp_endpoint: ${QW_ENABLE_OTLP_ENDPOINT:-true}

split_store_max_num_bytes: 100G

split_store_max_num_splits: 1000

max_concurrent_split_uploads: 12

# #

-------------------------------- Ingest API settings ------------------------------

#

ingest_api:

max_queue_memory_usage: 2GiB

max_queue_disk_usage: 4GiB

#

-------------------------------- Searcher settings --------------------------------

searcher:

fast_field_cache_capacity: 1G

split_footer_cache_capacity: 500M

max_num_concurrent_split_streams: 100

partial_request_cache_capacity: 64M

max_num_concurrent_split_searches: 100

#

-------------------------------- Jaeger settings --------------------------------

jaeger: enable_endpoint: ${QW_ENABLE_JAEGER_ENDPOINT:-true}

fmassot commented 3 months ago

Hi, @dkim94, thanks for the report.

The quickwit 0.8 ingest REST API does not validate documents. We have been working on a new ingest REST API called ingest V2, which allows Quickwit to validate documents.

If you want to try it, check the internal docs, and use the elasticsearch bulk API. If some documents are invalid, Quickwit will return a 200 response with errors per document.

trinity-1686a commented 3 months ago

hum, for missing timestamps it won't return an error just yet, see https://github.com/quickwit-oss/quickwit/issues/5164

dkim94 commented 3 months ago

Hi, @dkim94, thanks for the report.

The quickwit 0.8 ingest REST API does not validate documents. We have been working on a new ingest REST API called ingest V2, which allows Quickwit to validate documents.

If you want to try it, check the internal docs, and use the elasticsearch bulk API. If some documents are invalid, Quickwit will return a 200 response with errors per document.

Thanks for the reply. I was also looking for the ES compatiple API, guess I got lucky. Thanks!

dkim94 commented 3 months ago

hum, for missing timestamps it won't return an error just yet, see #5164

Yes the API doesn't return an error. The Quckwit instance also does not return an error, just a warning. However the logged message shows that an error has occurred. Maybe this also needs to be handled on the instance side in the future?

quickwit-oss / quickwit

Ingest API returns 200 even though doc mapper parse error occurred in Quickwit Instance #5356

============================ Node Configuration ==============================

Website: https://quickwit.io

Docs: https://quickwit.io/docs/configuration/node-config

Configure AWS credentials: https://quickwit.io/docs/guides/aws-setup#aws-credentials

-------------------------------- General settings --------------------------------

Config file format version.

Node ID. Must be unique within a cluster. If not set, a random node ID is generated on each startup.

node_id: node-1

Quickwit opens three sockets.

- for its HTTP server, hosting the UI and the REST API (TCP)

- for its gRPC service (TCP)

- for its Gossip cluster membership service (UDP)

All three services are bound to the same host and a different port. The host can be an IP address or a hostname.

Default HTTP server host is 127.0.0.1 and default HTTP port is 7280.

The default host value was chosen to avoid exposing the node to the open-world without users' explicit consent.

This allows for testing Quickwit in single-node mode or with multiple nodes running on the same host and listening

on different ports. However, in cluster mode, using this value is never appropriate because it causes the node to

ignore incoming traffic.

There are two options to set up a node in cluster mode:

1. specify the node's hostname or IP

2. pass 0.0.0.0 and let Quickwit do its best to discover the node's IP (see advertise_address)

rest:

listen_port: 7280

cors_allow_origins:

- "http://localhost:3000"

extra_headers:

x-header-1: header-value-1

x-header-2: header-value-2

grpc:

max_message_size: 10 MiB

IP address advertised by the node, i.e. the IP address that peer nodes should use to connect to the node for RPCs.

The environment variable QW_ADVERTISE_ADDRESS can also be used to override this value.

The default advertise address is listen_address. If listen_address is unspecified (0.0.0.0),

Quickwit attempts to sniff the node's IP by scanning the available network interfaces.

advertise_address: 192.168.0.42

In order to join a cluster, one needs to specify a list of

seeds to connect to. If no port is specified, Quickwit will assume

the seeds are using the same port as the current node gossip port.

By default, the peer seed list is empty.

peer_seeds:

- quickwit-searcher-0.local

- quickwit-searcher-1.local:10000

Path to directory where temporary data (caches, intermediate indexing data structures)

is stored. Defaults to ./qwdata.

data_dir: /path/to/data/dir

Metastore URI. Defaults to data_dir/indexes#polling_interval=30s,

which is a file-backed metastore and mostly convenient for testing. A cluster would

require a metastore backed by Amzon S3 or PostgreSQL.

metastore_uri: s3://your-bucket/indexes

metastore_uri: postgres://username:password@host:port/db

When using a file-backed metastore, the state of the metastore will be cached forever.

If you are indexing and searching from different processes, it is possible to periodically

refresh the state of the metastore on the searcher using the polling_interval hashtag.

metastore_uri: s3://your-bucket/indexes#polling_interval=30s

Default index root URI, which defines where index data (splits) is stored,

following the scheme {default_index_root_uri}/{index-id}. Defaults to {data_dir}/indexes.

default_index_root_uri: s3://your-bucket/indexes

-------------------------------- Storage settings --------------------------------

Hardcoding credentials into configuration files is not secure and strongly

discouraged. Prefer the alternative authentication methods that your storage

backend may provide.

storage:

azure:

account: ${QW_AZURE_STORAGE_ACCOUNT}

access_key: ${QW_AZURE_STORAGE_ACCESS_KEY}

s3:

access_key_id: ${AWS_ACCESS_KEY_ID}

secret_access_key: ${AWS_SECRET_ACCESS_KEY}

endpoint: ${QW_S3_ENDPOINT}

force_path_style_access: ${QW_S3_FORCE_PATH_STYLE_ACCESS:-false}

disable_multi_object_delete: false

disable_multipart_upload: false

-------------------------------- Metastore settings --------------------------------

metastore:

postgres:

min_connections: 0

max_connections: 10

acquire_connection_timeout: 10s

Default HTTP server host is `127.0.0.1` and default HTTP port is 7280.

2. pass `0.0.0.0` and let Quickwit do its best to discover the node's IP (see `advertise_address`)

The environment variable `QW_ADVERTISE_ADDRESS` can also be used to override this value.

The default advertise address is `listen_address`. If `listen_address` is unspecified (`0.0.0.0`),

is stored. Defaults to `./qwdata`.

Metastore URI. Defaults to `data_dir/indexes#polling_interval=30s`,

refresh the state of the metastore on the searcher using the `polling_interval` hashtag.

following the scheme `{default_index_root_uri}/{index-id}`. Defaults to `{data_dir}/indexes`.