Open dkim94 opened 3 months ago
Hi, @dkim94, thanks for the report.
The quickwit 0.8 ingest REST API does not validate documents. We have been working on a new ingest REST API called ingest V2, which allows Quickwit to validate documents.
If you want to try it, check the internal docs, and use the elasticsearch bulk
API. If some documents are invalid, Quickwit will return a 200 response with errors per document.
hum, for missing timestamps it won't return an error just yet, see https://github.com/quickwit-oss/quickwit/issues/5164
Hi, @dkim94, thanks for the report.
The quickwit 0.8 ingest REST API does not validate documents. We have been working on a new ingest REST API called ingest V2, which allows Quickwit to validate documents.
If you want to try it, check the internal docs, and use the elasticsearch
bulk
API. If some documents are invalid, Quickwit will return a 200 response with errors per document.
Thanks for the reply. I was also looking for the ES compatiple API, guess I got lucky. Thanks!
hum, for missing timestamps it won't return an error just yet, see #5164
Yes the API doesn't return an error. The Quckwit instance also does not return an error, just a warning. However the logged message shows that an error has occurred. Maybe this also needs to be handled on the instance side in the future?
Describe the bug The Ingest API returns
status_code=200
andnum_docs_for_processing
, even though no document was actually ingested.The Quickwit Instance log shows that a
doc mapper parse error
occurred because there was notimestamp
in the data. However from the client's side we getstatus_code=200
along withnum_docs_for_processing
so we are led to believe that the documents were successfully ingested.Steps to reproduce (if applicable) Steps to reproduce the behavior:
print(response.status_code) # 200 print(response.text) # { "num_docs_for_processing": 22 }
... 2024-08-28T09:39:37.929Z WARN quickwit_indexing::actors::doc_processor: doc mapper parse error: the document must contain field "timestamp field is required" index_id="code-meta" source_id="_ingest-api-source" ... 2024-08-28T09:39:37.936Z INFO index-doc-batches{index_id=code-meta source_id=_ingest-api-source pipeline_uid=01J6BRSCJ843ANR9RKF941KT0N workbench_id=01J6C5NQTAJY6DV91TB5PFHAHK}:publisher{split_update=SplitsUpdate { index_id: "code-meta", new_splits: "", checkpoint_delta: Some(_ingest-api-source:∆(ingest_partition_01J6BRQBP10KVEQMK5RHTY7F2F:(00000000000000000124..00000000000000000147])) }}: quickwit_indexing::actors::publisher: publish-new-splits new_splits=[] checkpoint_delta=Some(_ingest-api-source:∆(ingest_partition_01J6BRQBP10KVEQMK5RHTY7F2F:(00000000000000000124..00000000000000000147]))
============================ Node Configuration ==============================
#
Website: https://quickwit.io
Docs: https://quickwit.io/docs/configuration/node-config
#
Configure AWS credentials: https://quickwit.io/docs/guides/aws-setup#aws-credentials
#
-------------------------------- General settings --------------------------------
#
Config file format version.
# version: 0.7
cluster_id: ${QW_CLUSTER_ID} #
Node ID. Must be unique within a cluster. If not set, a random node ID is generated on each startup.
#
node_id: node-1
#
Quickwit opens three sockets.
- for its HTTP server, hosting the UI and the REST API (TCP)
- for its gRPC service (TCP)
- for its Gossip cluster membership service (UDP)
#
All three services are bound to the same host and a different port. The host can be an IP address or a hostname.
#
Default HTTP server host is
127.0.0.1
and default HTTP port is 7280.The default host value was chosen to avoid exposing the node to the open-world without users' explicit consent.
This allows for testing Quickwit in single-node mode or with multiple nodes running on the same host and listening
on different ports. However, in cluster mode, using this value is never appropriate because it causes the node to
ignore incoming traffic.
There are two options to set up a node in cluster mode:
1. specify the node's hostname or IP
2. pass
0.0.0.0
and let Quickwit do its best to discover the node's IP (seeadvertise_address
)# listen_address: ${QW_LISTEN_ADDRESS} #
rest:
listen_port: 7280
cors_allow_origins:
- "http://localhost:3000"
extra_headers:
x-header-1: header-value-1
x-header-2: header-value-2
#
grpc:
max_message_size: 10 MiB
#
IP address advertised by the node, i.e. the IP address that peer nodes should use to connect to the node for RPCs.
The environment variable
QW_ADVERTISE_ADDRESS
can also be used to override this value.The default advertise address is
listen_address
. Iflisten_address
is unspecified (0.0.0.0
),Quickwit attempts to sniff the node's IP by scanning the available network interfaces.
advertise_address: 192.168.0.42
#
In order to join a cluster, one needs to specify a list of
seeds to connect to. If no port is specified, Quickwit will assume
the seeds are using the same port as the current node gossip port.
By default, the peer seed list is empty.
#
peer_seeds:
- quickwit-searcher-0.local
- quickwit-searcher-1.local:10000
#
Path to directory where temporary data (caches, intermediate indexing data structures)
is stored. Defaults to
./qwdata
.#
data_dir: /path/to/data/dir
#
Metastore URI. Defaults to
data_dir/indexes#polling_interval=30s
,which is a file-backed metastore and mostly convenient for testing. A cluster would
require a metastore backed by Amzon S3 or PostgreSQL.
#
metastore_uri: s3://your-bucket/indexes
metastore_uri: s3://${MY_BUCKET}/qw_indexes
metastore_uri: postgres://username:password@host:port/db
#
When using a file-backed metastore, the state of the metastore will be cached forever.
If you are indexing and searching from different processes, it is possible to periodically
refresh the state of the metastore on the searcher using the
polling_interval
hashtag.#
metastore_uri: s3://your-bucket/indexes#polling_interval=30s
#
Default index root URI, which defines where index data (splits) is stored,
following the scheme
{default_index_root_uri}/{index-id}
. Defaults to{data_dir}/indexes
.#
default_index_root_uri: s3://your-bucket/indexes
default_index_root_uri: s3://${MY_BUCKET}/qw_indexes #
-------------------------------- Storage settings --------------------------------
#
Hardcoding credentials into configuration files is not secure and strongly
discouraged. Prefer the alternative authentication methods that your storage
backend may provide.
#
storage:
azure:
account: ${QW_AZURE_STORAGE_ACCOUNT}
access_key: ${QW_AZURE_STORAGE_ACCESS_KEY}
#
s3:
access_key_id: ${AWS_ACCESS_KEY_ID}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
endpoint: ${QW_S3_ENDPOINT}
force_path_style_access: ${QW_S3_FORCE_PATH_STYLE_ACCESS:-false}
disable_multi_object_delete: false
disable_multipart_upload: false
#
-------------------------------- Metastore settings --------------------------------
#
metastore:
postgres:
min_connections: 0
max_connections: 10
acquire_connection_timeout: 10s
idle_connection_timeout: 10min
max_connection_lifetime: 30min
#
-------------------------------- Indexer settings --------------------------------
indexer: enable_otlp_endpoint: ${QW_ENABLE_OTLP_ENDPOINT:-true}
split_store_max_num_bytes: 100G
split_store_max_num_splits: 1000
max_concurrent_split_uploads: 12
# #
-------------------------------- Ingest API settings ------------------------------
#
ingest_api:
max_queue_memory_usage: 2GiB
max_queue_disk_usage: 4GiB
#
-------------------------------- Searcher settings --------------------------------
searcher:
fast_field_cache_capacity: 1G
split_footer_cache_capacity: 500M
max_num_concurrent_split_streams: 100
partial_request_cache_capacity: 64M
max_num_concurrent_split_searches: 100
#
-------------------------------- Jaeger settings --------------------------------
jaeger: enable_endpoint: ${QW_ENABLE_JAEGER_ENDPOINT:-true}