Unable to generate islet cell snATAC count matrix

viannegao-zz commented 2 years ago

Hello! I am trying to generate the count matrix for the islet cell snATAC data by running snATAC_pipeline.py with fastq files pulled from GEO.

Unfortunately I am unable to run the whole pipeline and keep getting an empty file for XXX.filt.md.bam. It seemed like an error with picard, so I manually ran the following command and received an error:

java -Xmx24G -jar picard.jar MarkDuplicates INPUT=SRR12957014.compiled.filt.bam OUTPUT=SRR12957014.filt.md.bam VALIDATION_STRINGENCY=LENIENT BARCODE_TAG=BX METRICS_FILE=SRR12957014.MarkDuplicates.log REMOVE_DUPLICATES=false

[Tue Feb 01 09:45:07 EST 2022] Executing as gaov@lx14 on Linux 3.10.0-957.12.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.031-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.26.10 INFO 2022-02-01 09:45:08 MarkDuplicates Start of doWork freeMemory: 2036051352; totalMemory: 2058354688; maxMemory: 22906667008 INFO 2022-02-01 09:45:08 MarkDuplicates Reading input file and constructing read end information. INFO 2022-02-01 09:45:08 MarkDuplicates Will retain up to 70699589 data points before spilling to disk. WARNING 2022-02-01 09:45:08 AbstractOpticalDuplicateFinderCommandLineProgram A field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR12957014.1.13818841. Cause: String 'SRR12957014.1.13818841_' did not start with a parsable number. [Tue Feb 01 09:45:08 EST 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=2058354688 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" picard.PicardException: UMI found with illegal characters. UMIs must match the regular expression ^[ATCGNatcgn-]*$. at picard.sam.markduplicates.UmiUtil.getTopStrandNormalizedUmi(UmiUtil.java:73) at picard.sam.markduplicates.MarkDuplicates.buildReadEnds(MarkDuplicates.java:679) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:551) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:258) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

I then checked the top 5 lines in XXX.compiled.bam file and got the following:

SRR12957015.1.50091124 99 chr1 10064 37 50M = 10307 288 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCC D?DDDIHIIIIIHIHCHHGHIHCHHIIIIHHHECCEHHI1<1<FH0D0<D NM:i:0 MD:Z:50 AS:i:50 XS:i:50 XA:Z:chr4,-191043979,50M,0;chr10,-135524462,50M,0;chr7,+10197,48M2S,0;chr11,+175738,50M,1;chr12,-95475,5S45M,0; MQ:i:37 MC:Z:5S45M BX:Z:SRR12957015.1.50091124 SRR12957015.1.50024188 163 chr1 10100 34 50M = 10270 221 CCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC DDDDDHIHHIIFIIHIIIIIIHIGHHIEHFHH1<FCFH@GF?HH1C<1DH NM:i:0 MD:Z:50 AS:i:50 XS:i:50 MQ:i:34 MC:Z:30M1D20M BX:Z:SRR12957015.1.50024188 SRR12957015.1.17592450_ 147 chr1 10153 40 36M1D14M = 10166 -38 ACCCTAACCCTAACCCTAACCCTAACCTAACCTTAACCTAACCTTAACCC CIHECCEIHIIIHHHHHEDIIIHIIHHHCHIIHHIHIHHHHHHHFDDDDD NM:i:3 MD:Z:32C3^C7C6 AS:i:33 XS:i:33 XA:Z:chrUngl000227,-73922,36M1D14M,3;chr10,-47667359,29M21S,0;chr17,+81195004,23S27M,0;chr20,+62918614,12M1D38M,5; MQ:i:53 MC:Z:50M BX:Z:SRR12957015.1.17592450 SRR12957015.1.17592450 99 chr1 10166 53 50M = 10153 38 CCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC DDDDCHIIIIIGIIIHHIIIIIIHHHIIIIIIIIIIIGHIIII1<FHH<1 NM:i:0 MD:Z:50 AS:i:50 XS:i:45 XA:Z:chr17,-81194997,50M,1;chr20,-62918608,6S44M,0; MQ:i:40 MC:Z:36M1D14M BX:Z:SRR12957015.1.17592450 SRR12957015.1.16969529_ 99 chr1 10251 35 50M = 10296 95 CCCTAAACCCTAACCCTAACCCTAACCCTAAACCCAACCCTAACCCTAAC DDDDDIIIIIIIIIIIIHIIIIIHIIIIIIIIGIHHIHHIHIHHHIHHIH NM:i:4 MD:Z:31C2T5C5C3 AS:i:31 XS:i:40 MQ:i:35 MC:Z:50M BX:Z:SRR12957015.1.16969529

The read names seem to be off. Could you please let me know how to resolve this? If possible, could you please share the processed count matrix of the 3 patients?

Thank you so much in advance!

Vianne

joshchiou commented 2 years ago

Hi Vianne,

It’s possible the error occurs because SRA renames the read names. If you download the file with the original read names, it should work.

Alternatively, here is a link to the processed final data object in .h5ad format. direct download: https://cmdga.org/files/DFF877UMT/@@download/DFF877UMT.h5ad.gz experiment page: https://cmdga.org/embedding/DSR322WJB/

All the best, Josh

From: Vianne Gao @.> Date: Tuesday, February 1, 2022 at 9:51 AM To: kjgaulton/pipelines @.> Cc: Subscribed @.***> Subject: [EXTERNAL] [kjgaulton/pipelines] Unable to generate islet cell snATAC count matrix (Issue #12)

Hello! I am trying to generate the count matrix for the islet cell snATAC data by running snATAC_pipeline.py with fastq files pulled from GEO.

Unfortunately I am unable to run the whole pipeline and keep getting an empty file for XXX.filt.md.bam. It seemed like an error with picard, so I manually ran the following command and received an error:

java -Xmx24G -jar picard.jar MarkDuplicates INPUT=SRR12957014.compiled.filt.bam OUTPUT=SRR12957014.filt.md.bam VALIDATION_STRINGENCY=LENIENT BARCODE_TAG=BX METRICS_FILE=SRR12957014.MarkDuplicates.log REMOVE_DUPLICATES=false

[Tue Feb 01 09:45:07 EST 2022] Executing as @.** on Linux 3.10.0-957.12.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.031-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.26.10 INFO 2022-02-01 09:45:08 MarkDuplicates Start of doWork freeMemory: 2036051352; totalMemory: 2058354688; maxMemory: 22906667008 INFO 2022-02-01 09:45:08 MarkDuplicates Reading input file and constructing read end information. INFO 2022-02-01 09:45:08 MarkDuplicates Will retain up to 70699589 data points before spilling to disk. WARNING 2022-02-01 09:45:08 AbstractOpticalDuplicateFinderCommandLineProgram A field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR12957014.1.13818841. Cause: String 'SRR12957014.1.13818841_' did not start with a parsable number. [Tue Feb 01 09:45:08 EST 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=2058354688 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp<https://urldefense.com/v3/__http:/broadinstitute.github.io/picard/index.htmlGettingHelp__;Iw!!H9nueQsQ!qW8dM7eB_dzVCHK4CPFXH1vCwZ0zKGKGNjCWAh66Z_FwighkDxx1X-BwwpFwBlW6Z3w$> Exception in thread "main" picard.PicardException: UMI found with illegal characters. UMIs must match the regular expression ^[ATCGNatcgn-]*$. at picard.sam.markduplicates.UmiUtil.getTopStrandNormalizedUmi(UmiUtil.java:73) at picard.sam.markduplicates.MarkDuplicates.buildReadEnds(MarkDuplicates.java:679) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:551) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:258) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

I then checked the top 5 lines in XXX.compiled.bam file and got the following:

SRR12957015.1.50091124 99 chr1 10064 37 50M = 10307 288 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCC D?DDDIHIIIIIHIHCHHGHIHCHHIIIIHHHECCEHHI1<1<FH0D0<D NM:i:0 MD:Z:50 AS:i:50 XS:i:50 XA:Z:chr4,-191043979,50M,0;chr10,-135524462,50M,0;chr7,+10197,48M2S,0;chr11,+175738,50M,1;chr12,-95475,5S45M,0; MQ:i:37 MC:Z:5S45M BX:Z:SRR12957015.1.50091124 SRR12957015.1.50024188 163 chr1 10100 34 50M = 10270 221 CCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC @.***?HH1C<1DH NM:i:0 MD:Z:50 AS:i:50 XS:i:50 MQ:i:34 MC:Z:30M1D20M BX:Z:SRR12957015.1.50024188 SRR12957015.1.17592450_ 147 chr1 10153 40 36M1D14M = 10166 -38 ACCCTAACCCTAACCCTAACCCTAACCTAACCTTAACCTAACCTTAACCC CIHECCEIHIIIHHHHHEDIIIHIIHHHCHIIHHIHIHHHHHHHFDDDDD NM:i:3 MD:Z:32C3^C7C6 AS:i:33 XS:i:33 XA:Z:chrUngl000227,-73922,36M1D14M,3;chr10,-47667359,29M21S,0;chr17,+81195004,23S27M,0;chr20,+62918614,12M1D38M,5; MQ:i:53 MC:Z:50M BX:Z:SRR12957015.1.17592450 SRR12957015.1.17592450 99 chr1 10166 53 50M = 10153 38 CCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC DDDDCHIIIIIGIIIHHIIIIIIHHHIIIIIIIIIIIGHIIII1<FHH<1 NM:i:0 MD:Z:50 AS:i:50 XS:i:45 XA:Z:chr17,-81194997,50M,1;chr20,-62918608,6S44M,0; MQ:i:40 MC:Z:36M1D14M BX:Z:SRR12957015.1.17592450 SRR12957015.1.16969529_ 99 chr1 10251 35 50M = 10296 95 CCCTAAACCCTAACCCTAACCCTAACCCTAAACCCAACCCTAACCCTAAC DDDDDIIIIIIIIIIIIHIIIIIHIIIIIIIIGIHHIHHIHIHHHIHHIH NM:i:4 MD:Z:31C2T5C5C3 AS:i:31 XS:i:40 MQ:i:35 MC:Z:50M BX:Z:SRR12957015.1.16969529

The read names seems to be off. Could you please let me know how to resolve this? If possible, could you please share the processed count matrix of the 3 patients?

Thank you so much in advance!

Vianne

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/kjgaulton/pipelines/issues/12__;!!H9nueQsQ!qW8dM7eB_dzVCHK4CPFXH1vCwZ0zKGKGNjCWAh66Z_FwighkDxx1X-BwwpFwer6G-bQ$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AFSOT3TWTP33U7GAKZ3IHLTUY7XMVANCNFSM5NJLSIJA__;!!H9nueQsQ!qW8dM7eB_dzVCHK4CPFXH1vCwZ0zKGKGNjCWAh66Z_FwighkDxx1X-BwwpFwtIms_-U$. Triage notifications on the go with GitHub Mobile for iOShttps://urldefense.com/v3/__https:/apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!H9nueQsQ!qW8dM7eB_dzVCHK4CPFXH1vCwZ0zKGKGNjCWAh66Z_FwighkDxx1X-BwwpFw9VXXxl8$ or Androidhttps://urldefense.com/v3/__https:/play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!H9nueQsQ!qW8dM7eB_dzVCHK4CPFXH1vCwZ0zKGKGNjCWAh66Z_FwighkDxx1X-BwwpFwELgeZ3A$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

viannegao-zz commented 2 years ago

Hi Josh,

Thank you so much for your quick reply!

I am actually hoping to get the accessibility of 50bp tiles across the genome, so perhaps getting the file with the original read names is the way to go.

I tried to access the original fastq files through AWS, but am getting a permission error as follows:

aws s3 ls s3://sra-pub-src-16/SRR12957014/Islet2_CB.R1.fastq.gz.1* An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

Could you please help me to troubleshoot this? Thank you so much!

Best, Vianne

joshchiou commented 2 years ago

You may need to use the cloud delivery service to access the original files. You can select the SRR IDs that you want to download in your browser, and it will deposit it into the bucket of your choice.

https://www.ncbi.nlm.nih.gov/Traces/cloud-delivery/

From: Vianne Gao @.> Date: Tuesday, February 1, 2022 at 9:23 PM To: kjgaulton/pipelines @.> Cc: Chiou, Josh @.>, Comment @.> Subject: [EXTERNAL] Re: [kjgaulton/pipelines] Unable to generate islet cell snATAC count matrix (Issue #12)

Hi Josh,

Thank you so much for your quick reply!

I am actually hoping to get the accessibility of 50bp tiles across the genome, so perhaps getting the file with the original read names is the way to go.

I tried to access the original fastq files through AWS, but am getting a permission error as follows:

aws s3 ls s3://sra-pub-src-16/SRR12957014/Islet2_CB.R1.fastq.gz.1* An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

Could you please help me to troubleshoot this? Thank you so much!

Best, Vianne

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/kjgaulton/pipelines/issues/12*issuecomment-1027513676__;Iw!!H9nueQsQ!ro7CrbhYXipoP_QxIlApXiwEXQsFQ5vxUK0_pqYYYKS1Z0-1OBB3O13K2ycOkba1EBo$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AFSOT3QSGLMYIESRYP5WF23UZCIR7ANCNFSM5NJLSIJA__;!!H9nueQsQ!ro7CrbhYXipoP_QxIlApXiwEXQsFQ5vxUK0_pqYYYKS1Z0-1OBB3O13K2ycOOll0jxc$. Triage notifications on the go with GitHub Mobile for iOShttps://urldefense.com/v3/__https:/apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!H9nueQsQ!ro7CrbhYXipoP_QxIlApXiwEXQsFQ5vxUK0_pqYYYKS1Z0-1OBB3O13K2ycOO27uX7k$ or Androidhttps://urldefense.com/v3/__https:/play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!H9nueQsQ!ro7CrbhYXipoP_QxIlApXiwEXQsFQ5vxUK0_pqYYYKS1Z0-1OBB3O13K2ycOrQhAPjI$. You are receiving this because you commented.Message ID: @.***>

susannapa commented 2 years ago

Hello,

I downloaded the processed final data object from here (https://cmdga.org/files/DFF877UMT/@@download/DFF877UMT.h5ad.gz) but I am having issues opening the file. I tried to decompress it and got this error: [[gzip: DFF877UMT.h5ad.gz: unexpected end of file]]. I also tried to open it with scanpy and got this error [[OSError: Unable to open file (truncated file: eof = 3269147629, sblock->base_addr = 0, stored_eof = 5952209934)]].

Could you please help me?

Thank you in advance, Susanna

joshchiou commented 2 years ago

Sorry about that. Can you try this file instead (https://cmdga.org/files/DFF446JDW/@@download/DFF446JDW.h5ad.gz, from https://cmdga.org/embedding/DSR047LET/)? It is from a different publication, but is a superset of the same islet samples from the object you tried to download, plus additional islet, pancreas, and PBMC samples.

Best, Josh

From: SusannaPagni @.> Date: Thursday, September 1, 2022 at 10:14 AM To: kjgaulton/pipelines @.> Cc: Chiou, Josh @.>, Comment @.> Subject: [EXTERNAL] Re: [kjgaulton/pipelines] Unable to generate islet cell snATAC count matrix (Issue #12)

Hello,

I downloaded the processed final data object from here (https://cmdga.org/files/DFF877UMT/@@download/DFF877UMT.h5ad.gz https://urldefense.com/v3/__https:/cmdga.org/files/DFF877UMT/@@download/DFF877UMT.h5ad.gz__;!!H9nueQsQ!4G-h3bYA5CUHIAUsOcscItpwi3PprZVcIq7ukzto-jVqV8zWLcBhXMBzl6A2xU4yd-Q6lIMwnpka-6mXvIckVNM1Id2R$) but I am having issues opening the file. I tried to decompress it and got this error: [[gzip: DFF877UMT.h5ad.gz: unexpected end of file]]. I also tried to open it with scanpy and got this error [[OSError: Unable to open file (truncated file: eof = 3269147629, sblock->base_addr = 0, stored_eof = 5952209934)]].

Could you please help me?

Thank you in advance, Susanna

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/kjgaulton/pipelines/issues/12*issuecomment-1234341015__;Iw!!H9nueQsQ!4G-h3bYA5CUHIAUsOcscItpwi3PprZVcIq7ukzto-jVqV8zWLcBhXMBzl6A2xU4yd-Q6lIMwnpka-6mXvIckVO8UW0jM$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AFSOT3VNSJCI4DINCZT7LD3V4C2X3ANCNFSM5NJLSIJA__;!!H9nueQsQ!4G-h3bYA5CUHIAUsOcscItpwi3PprZVcIq7ukzto-jVqV8zWLcBhXMBzl6A2xU4yd-Q6lIMwnpka-6mXvIckVKisySQL$. You are receiving this because you commented.Message ID: @.***>

susannapa commented 2 years ago

Hi Josh,

Thank you for your help. I downloaded the file you pointed me to and I was able to open it fine. However, there are no peak names but only numbers, and I cannot understand the count matrix format (please see the attached picture).

Could you please help me with this?

Many thanks in advance, Susanna

joshchiou commented 2 years ago

It looks like scanpy not reading the delimiters correctly. For the first entry, the (0,59) corresponds to the sparse matrix indices, and the 1.0 corresponds to the count.

Scanpy uses AnnData to read and write G5’s d, and it looks like they had a format change starting from version 0.8.0 onwards. https://anndata.readthedocs.io/en/latest/ If you downgrade to an earlier version, or find out what format changes they made in the new version, it will probably solve your problem.

On Sep 5, 2022, at 8:32 AM, SusannaPagni @.***> wrote:

Hi Josh,

Thank you for your help. I downloaded the file you pointed me to and I was able to open it fine. However, there are no peak names but only numbers, and I cannot understand the count matrix format (please see the attached picture).

Could you please help me with this? [Screenshot 2022-09-05 at 13 31 11]https://urldefense.com/v3/__https://user-images.githubusercontent.com/64579633/188449969-f2a08d22-8709-49bf-bf02-84cb799b520d.png__;!!H9nueQsQ!_186Jvka_Ky28IA8F0ONzTm9dIBlqIvFAP5O_pAwJXJ7kKtxv32NyPpymJPkjXabNgmaAnsmuxdzEqHfdSUWFp8WwWqi$

Many thanks in advance, Susanna

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/kjgaulton/pipelines/issues/12*issuecomment-1236948878__;Iw!!H9nueQsQ!_186Jvka_Ky28IA8F0ONzTm9dIBlqIvFAP5O_pAwJXJ7kKtxv32NyPpymJPkjXabNgmaAnsmuxdzEqHfdSUWFm_qjvtM$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AFSOT3WWADGOY5Z3L4O34N3V4XR6VANCNFSM5NJLSIJA__;!!H9nueQsQ!_186Jvka_Ky28IA8F0ONzTm9dIBlqIvFAP5O_pAwJXJ7kKtxv32NyPpymJPkjXabNgmaAnsmuxdzEqHfdSUWFv1JVFFo$. You are receiving this because you commented.Message ID: @.***>

susannapa commented 2 years ago

Hi Josh,

Thank you for your help. I was able to extract the matrix in the right format. However, I would like to also have the peak information, typically this information is stored in adata.var_names in the format chr_start_end, would you happen to have these annotations for this data?

Thank you, Susanna

joshchiou commented 2 years ago

Sorry about that. I’m not sure why the peak coordinates are missing in the final object. You can find the information in Supplementary Data 3 (https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-021-03552-w/MediaObjects/41586_2021_3552_MOESM6_ESM.xlsx) from the accompanying paper.

From: SusannaPagni @.> Date: Monday, September 5, 2022 at 12:26 PM To: kjgaulton/pipelines @.> Cc: Chiou, Josh @.>, Comment @.> Subject: [EXTERNAL] Re: [kjgaulton/pipelines] Unable to generate islet cell snATAC count matrix (Issue #12)

Hi Josh,

Thank you for your help. I was able to extract the matrix in the right format. However, I would like to also have the peak information, typically this information is stored in adata.var_names in the format chr_start_end, would you happen to have these annotations for this data?

Thank you, Susanna

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/kjgaulton/pipelines/issues/12*issuecomment-1237278923__;Iw!!H9nueQsQ!6EklJGVQiMNwdurGkrBpTUsAEQrmrCH8Scli9SxyV465HH9ij_palx3z6iYD2QhvYgQwsHD3qCOzOfbE8Bziss-GrMLM$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AFSOT3QN2UMN2LB74OYACXTV4YNMLANCNFSM5NJLSIJA__;!!H9nueQsQ!6EklJGVQiMNwdurGkrBpTUsAEQrmrCH8Scli9SxyV465HH9ij_palx3z6iYD2QhvYgQwsHD3qCOzOfbE8BzisgLjevul$. You are receiving this because you commented.Message ID: @.***>

Drizzle-Zhang commented 1 year ago

Sorry about that. Can you try this file instead (https://cmdga.org/files/DFF446JDW/@@download/DFF446JDW.h5ad.gz, from https://cmdga.org/embedding/DSR047LET/)? It is from a different publication, but is a superset of the same islet samples from the object you tried to download, plus additional islet, pancreas, and PBMC samples. Best, Josh From: SusannaPagni @.> Date: Thursday, September 1, 2022 at 10:14 AM To: kjgaulton/pipelines @.> Cc: Chiou, Josh @.>, Comment @.> Subject: [EXTERNAL] Re: [kjgaulton/pipelines] Unable to generate islet cell snATAC count matrix (Issue #12) Hello, I downloaded the processed final data object from here (https://cmdga.org/files/DFF877UMT/@@download/DFF877UMT.h5ad.gz<https://urldefense.com/v3/https:/cmdga.org/files/DFF877UMT/@@download/DFF877UMT.h5ad.gz;!!H9nueQsQ!4G-h3bYA5CUHIAUsOcscItpwi3PprZVcIq7ukzto-jVqV8zWLcBhXMBzl6A2xU4yd-Q6lIMwnpka-6mXvIckVNM1Id2R$>) but I am having issues opening the file. I tried to decompress it and got this error: [[gzip: DFF877UMT.h5ad.gz: unexpected end of file]]. I also tried to open it with scanpy and got this error [[OSError: Unable to open file (truncated file: eof = 3269147629, sblock->base_addr = 0, stored_eof = 5952209934)]]. Could you please help me? Thank you in advance, Susanna — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/https:/github.com/kjgaulton/pipelines/issues/12*issuecomment-1234341015;Iw!!H9nueQsQ!4G-h3bYA5CUHIAUsOcscItpwi3PprZVcIq7ukzto-jVqV8zWLcBhXMBzl6A2xU4yd-Q6lIMwnpka-6mXvIckVO8UW0jM$>, or unsubscribe<https://urldefense.com/v3/https:/github.com/notifications/unsubscribe-auth/AFSOT3VNSJCI4DINCZT7LD3V4C2X3ANCNFSM5NJLSIJA;!!H9nueQsQ!4G-h3bYA5CUHIAUsOcscItpwi3PprZVcIq7ukzto-jVqV8zWLcBhXMBzl6A2xU4yd-Q6lIMwnpka-6mXvIckVKisySQL$>. You are receiving this because you commented.Message ID: @.***>

Hello!

I have downloaded the dataset. However, after using the count matrix to perform dimension reduction and clustering, I find that the labels in the .h5ad file don't match the clusters. Could you please help me check whether the rows of the matrix correspond to the labels one by one?

Thanks very much!

kjgaulton / pipelines

Unable to generate islet cell snATAC count matrix #12