Error in `ficture run_together` with Visium HD data

Nick-Eagles commented 2 months ago

Hello,

First, thanks for developing this powerful software. I'm working with Visium HD data, so I first followed your tutorial for preparing the transcripts.tsv.gz. The first suspicious message I got was when running the following commands from the tutorial:

# Then annotate the coordinates and gene ID to the matrix file (assume
# matrix.mtx.gz is sorted by the pixel indices, which seems to be always true)
awk 'BEGIN{FS=OFS="\t"} NR==FNR{ft[NR]=$1 FS $2; next} ($4 in ft) {print $1, $2, $3, ft[$4], $5 }' \
<(zcat $mtx_dir/features.tsv.gz) \
<(\
join -t $'\t' -1 1 -2 2 $temp_dir/barcodes.tsv <(zcat $mtx_dir/matrix.mtx.gz | tail -n +4 | sed 's/ /\t/g') \
) | sort -k3,3n -k2,2n | sed '1 s/^/#barcode_idx\tX\tY\tgene_id\tgene\tCount\n/' | gzip -c > $out_dir/transcripts.tsv.gz

The warnings were:

join: /fastscratch/myscratch/neagles/barcodes.tsv:10: is not sorted: 10 24242.98    9587.82
join: /dev/fd/62:47: is not sorted: 2786    10  1
join: input is not in sorted order

I don't use join much, but read that it's problematic if inputs aren't sorted a particular way. Other than those warnings, I ultimately produced the transcripts.tsv.gz file.

Next, I attempted to run the entire FICTURE pipeline:

ficture run_together \
    --in-tsv $in_dir/transcripts.tsv.gz \
    --in-minmax $in_dir/coordinate_minmax.tsv \
    --out-dir $out_dir \
    --mu-scale $(python -c "print(1/${microns_per_pixel})") \
    --major-axis Y \
    --all

The error appears to occur in the make_dge step:

--------------------------------------------------------------
Creating DGE for 12um...
--------------------------------------------------------------
ficture make_dge --key Count --input /dcs05/lieber/lcolladotor/Visium_HD_DLPFC_pilot_LIBD4100/Visium_HD_DLPFC_pilot/processed-data/02_ficture/inputs/H1-6PGDCDB_D1/transcripts.tsv.gz --output /dcs05/lieber/lcolladotor/Visium_HD_DLPFC_pilot_LIBD4100/Visium_HD_DLPFC_pilot/processed-data/02_ficture/outputs/H1-6PGDCDB_D1/hexagon.d_12.tsv --hex_width 12 --n_move 2 --min_ct_per_unit 50 --mu_scale 3.958433301344706 --precision 2 --major_axis Y
INFO:root:Random seed 1723571016.2663805
WARNING:root:The designated major key is not one of the specified count columns, --count_header is ignored
['#barcode_idx', 'X', 'Y', 'gene_id', 'gene', 'Count']
Traceback (most recent call last):
  File "/jhpce/shared/libd/core/ficture/0.0.3.1/ficture_env/bin/ficture", line 8, in <module>
    sys.exit(main())
  File "/jhpce/shared/libd/core/ficture/0.0.3.1/ficture_env/lib/python3.9/site-packages/ficture/cli.py", line 43, in main
    function(sys.argv[2:])
  File "/jhpce/shared/libd/core/ficture/0.0.3.1/ficture_env/lib/python3.9/site-packages/ficture/scripts/make_dge_univ.py", line 99, in make_dge
    for chunk in pd.read_csv(args.input, sep='\t', chunksize=1000000, dtype=dty):
  File "/jhpce/shared/libd/core/ficture/0.0.3.1/ficture_env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1843, in __next__
    return self.get_chunk()
  File "/jhpce/shared/libd/core/ficture/0.0.3.1/ficture_env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1985, in get_chunk
    return self.read(nrows=size)
  File "/jhpce/shared/libd/core/ficture/0.0.3.1/ficture_env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/jhpce/shared/libd/core/ficture/0.0.3.1/ficture_env/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "parsers.pyx", line 850, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 921, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 1066, in pandas._libs.parsers.TextReader._convert_column_data
  File "parsers.pyx", line 1105, in pandas._libs.parsers.TextReader._convert_tokens
  File "parsers.pyx", line 1226, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Integer column has NA values in column 5

Do you have a sense of what I can check to investigate this error? Do you think the join warnings are problematic and/or related? Please let me know if I can provide additional information. Thanks in advance!

Best, -Nick

hyunminkang commented 2 months ago

Hi @Nick-Eagles , converting Visium HD data can be a bit tricky because it involves joining multiple files. We've written another tool to make this process more seamless. I will be giving the details in the next few days.

I still want to make sure whether our current bash-based instruction has any errors, I would suggest to see if your transcript file was correctly generated. It should be sorted by Y-axis by default. If it is sorted by X-axis, you may need to provide --major-axis option in ficture run_together. Could you confirm if your transcripts.tsv.gz is sorted properly?

Nick-Eagles commented 2 months ago

Awesome, a tool for preparing Visium HD would definitely be useful. I just verified that transcripts.tsv.gz is sorted by Y-axis (well, at least the first 10,000 rows), though during that check I noticed at least one row with NA Count, which sounds like the complaint in the original error message. As the data is large and private, I'm not sure the best way to help you diagnose things further. I'm happy to wait for the new tool unless you have other checks in mind.

hyunminkang commented 2 months ago

Hi @Nick-Eagles -- We have published a separate tool in spatula package to handle this request. In case the current documentation in FICTURE does not work, please follow the instruction in the spatual documentation at the following URL:

https://seqscope.github.io/spatula/tools/convert_sge/#converting-10x-genomics-visium-hd-feature-barcode-matrix

If you encounter any bugs or have any further questions, please let us know.

Nick-Eagles commented 1 month ago

Thanks, the spatula tool seems to be working so far. I'll close this issue and open a different one if I encounter issues.

seqscope / ficture

Error in `ficture run_together` with Visium HD data #11