ixxmu / mp_duty

抓取网络文章到github issues保存
https://archives.duty-machine.now.sh/
124 stars 31 forks source link

国家基因组科学数据中心 GSA数据下载、Linux上游分析实战(二) #5169

Closed ixxmu closed 5 months ago

ixxmu commented 5 months ago

https://mp.weixin.qq.com/s/uVMKL1Us4YWDVen2JB4kGQ

ixxmu commented 5 months ago

国家基因组科学数据中心 GSA数据下载、Linux上游分析实战(二) by 生信小博士

上次从GSA下载的结果如下:


1.初步质控看看结果如何

conda activate rna2mkdir -p fastqc_outputnohup fastqc -t 6 -o fastqc_output CRR*/*.fq.gz &


2.使用multiqc合并质控结果

multiqc ./
multiqc的主要参数--force -f Overwrite any existing reports │--config -c Specific config file to load, after those in MultiQC dir / home dir / working ││ dir. ││ (PATH) │--cl-config Specify MultiQC config YAML on the command line (TEXT) │--filename -n Report filename. Use 'stdout' to print to standard out. (TEXT) │--outdir -o Create report in the specified output directory. (TEXT) │--ignore -x Ignore analysis files (GLOB EXPRESSION) │--ignore-samples Ignore sample names (GLOB EXPRESSION) │--ignore-symlinks Ignore symlinked directories and files │--file-list -l Supply a file containing a list of file paths to be searched, one per row
结果如下:

使用浏览器打开看看
从质控图上看,测序质量挺好,接头含量也没超标


3.使用trimgalory进行质量控制(对于本数据可以忽略这一步)
ls | grep '^CRR' >sample.id #批量获取样本名称find ../CRA005925/ -mindepth 1 -type d|cut -d"/" -f3 >sample.id 
conda activate rna2
#!/bin/bashcat sample.id | while read iddo
mkdir ~/silicosis_wangchen/data/02_trimgalory/${id}echo "${id}"
echo "++++++++===========================+++++++++++"

trim_galore --phred33 -q 20 --length 36 \ --stringency 3 --fastqc --paired --max_n 3 \ -o ~/silicosis_wangchen/data/02_trimgalory/${id} \ ~/silicosis_wangchen/data/CRA005925/${id}/${id}_f1.fq.gz \ ~/silicosis_wangchen/data/CRA005925/${id}/${id}_f2.fq.gzdone
chmod +x batch_trimgalore.shnohup bash batch_trimgalore.sh >trim_galore.log 2>&1 &
trimgalory 详细参数
-q/--quality <INT> Trim low-quality ends from reads in addition to adapter removal. For RRBS samples, quality trimming will be performed first, and adapter trimming is carried in a second round. Other files are quality and adapter trimmed in a single pass. The algorithm is the same as the one used by BWA (Subtract INT from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal). Default Phred score: 20.
--phred33 Instructs Cutadapt to use ASCII+33 quality scores as Phred scores (Sanger/Illumina 1.9+ encoding) for quality trimming. Default: ON.
--phred64 Instructs Cutadapt to use ASCII+64 quality scores as Phred scores (Illumina 1.5 encoding) for quality trimming.
--fastqc Run FastQC in the default mode on the FastQ file once trimming is complete.
--fastqc_args "<ARGS>" Passes extra arguments to FastQC. If more than one argument is to be passed to FastQC they must be in the form "arg1 arg2 etc.". An example would be: --fastqc_args "--nogroup --outdir /home/". Passing extra arguments will automatically invoke FastQC, so --fastqc does not have to be specified separately.
-a/--adapter <STRING> Adapter sequence to be trimmed. If not specified explicitly, Trim Galore will try to auto-detect whether the Illumina universal, Nextera transposase or Illumina small RNA adapter sequence was used. Also see '--illumina', '--nextera' and '--small_rna'. If no adapter can be detected within the first 1 million sequences of the first file specified or if there is a tie between several adapter sequences, Trim Galore defaults to '--illumina' (as long as the Illumina adapter was one of the options, else '--nextera' is the default). A single base may also be given as e.g. -a A{10}, to be expanded to -a AAAAAAAAAA.
At a special request, multiple adapters can also be specified like so: -a " AGCTCCCG -a TTTCATTATAT -a TTTATTCGGATTTAT" -a2 " AGCTAGCG -a TCTCTTATAT -a TTTCGGATTTAT", or so: -a "file:../multiple_adapters.fa" -a2 "file:../different_adapters.fa" Potentially in conjucntion with the parameter "-n 3" to trim all adapters. Please note that this is NOT needed for standard trimming! More Information here: https://github.com/FelixKrueger/TrimGalore/issues/86
--max_length <INT> Discard reads that are longer than <INT> bp after trimming. This is only advised for smallRNA sequencing to remove non-small RNA sequences.

--stringency <INT> Overlap with adapter sequence required to trim a sequence. Defaults to a very stringent setting of 1, i.e. even a single bp of overlapping sequence will be trimmed off from the 3' end of any read.

--length <INT> Discard reads that became shorter than length INT because of either quality or adapter trimming. A value of '0' effectively disables this behaviour. Default: 20 bp.
For paired-end files, both reads of a read-pair need to be longer than <INT> bp to be printed out to validated paired-end files (see option --paired). If only one read became too short there is the possibility of keeping such unpaired single-end reads (see --retain_unpaired). Default pair-cutoff: 20 bp.
--max_n COUNT The total number of Ns a read may contain before it will be removed altogether. In a paired-end setting, either read exceeding this limit will result in the entire pair being removed from the trimmed output files. If COUNT is a number between 0 and 1, it is interpreted as a fraction of the read length.


在Linux中,你可以使用ls命令与通配符来获取当前目录下所有以"CRR"开头的文件名称。具体命令如下:

ls CRR*
此命令会列出当前目录中所有以"CRR"开头的文件和目录。如果你只想列出文件名,可以结合grep和ls命令来过滤:

ls | grep '^CRR'
此命令会列出当前目录中所有以"CRR"开头的文件和目录名称。如果你只想要文件而不包含目录,可以使用find命令:find . -maxdepth 1 -type f -name 'CRR*'
此命令会列出当前目录(不包括子目录)中所有以"CRR"开头的文件。




4.Hista2比对

cat >map.sh
#!/bin/bash
#获取样本名称,其实就是每个fastq文件的前缀ls ../CRA005925/ |grep '^CRR' |while read i
do echo ${i} echo "==========*****************=======" hisat2 -x ~/20240125_rnaseq/data/4086673388_YoungLeeyyL/_/index/grcm38/genome -p 20 --summary-file ./${i}.mapinfo.txt \ -1 ~/silicosis_wangchen/data/CRA005925/${i}/${i}_f1.fq.gz \ -2 ~/silicosis_wangchen/data/CRA005925/${i}/${i}_r2.fq.gz \ |samtools sort -@ 20 -o ${i}.sorted.bamdone
chmod +x map.shnohup bash map.sh >map.log 2>&1 &

重要参数
--phred33 qualities are Phred+33 (default) --phred64          qualities are Phred+64
-p/--threads <int> number of alignment threads to launch (1)
Usage: hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]
<ht2-idx> Index filename prefix (minus trailing .X.ht2). <m1> Files with #1 mates, paired with files in <m2>. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). <m2> Files with #2 mates, paired with files in <m1>. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). <r> Files with unpaired reads. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). <sam> File for SAM output (default: stdout)
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.


5.featurecounts定量
cat  >count.sh #!/bin/bashfeatureCounts -a ~/20240125_rnaseq/data/4086673388_YoungLeeyyL/_/index/Mus_musculus.GRCm38.102.gtf  \--extraAttributes gene_name,gene_biotype -T 60 \-t exon -p -o ./all_samples_counts.txt  ~/silicosis_wangchen/data/03map-data/*.bam


重要参数:Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ... 
## Mandatory arguments:
-a <string> Name of an annotation file. GTF/GFF format by default. See -F option for more format information. Inbuilt annotations (SAF format) is available in 'annotation' directory of the package. Gzipped file is also accepted.
-o <string> Name of output file including read counts. A separate file including summary statistics of counting results is also included in the output ('<string>.summary'). Both files are in tab delimited format.
input_file1 [input_file2] ... A list of SAM or BAM format files. They can be either name or location sorted. If no files provided,                      <stdin> input is expected. Location-sorted paired-end reads are automatically sorted by read names.
-T <int> Number of the threads. 1 by default.
# Parameters specific to paired end reads
-p If specified, libraries are assumed to contain paired-end reads. For any library that contains paired-end reads, the 'countReadPairs' parameter controls if read pairs or reads should be counted.


不可查看链接:马拉松课程https://mp.csdn.net/mp_blog/creation/editor/135890652
jimmy2022https://mp.csdn.net/mp_blog/creation/editor/128177352